Skip to main content
Log in

App store mining is not enough for app improvement

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

The rise in popularity of mobile devices has led to a parallel growth in the size of the app store market, intriguing several research studies and commercial platforms on mining app stores. App store reviews are used to analyze different aspects of app development and evolution. However, app users’ feedback does not only exist on the app store. In fact, despite the large quantity of posts that are made daily on social media, the importance and value that these discussions provide remain mostly unused in the context of mobile app development. In this paper, we study how Twitter can provide complementary information to support mobile app development. By analyzing a total of 30,793 apps over a period of six weeks, we found strong correlations between the number of reviews and tweets for most apps. Moreover, through applying machine learning classifiers, topic modeling and subsequent crowd-sourcing, we successfully mined 22.4% additional feature requests and 12.89% additional bug reports from Twitter. We also found that 52.1% of all feature requests and bug reports were discussed on both tweets and reviews. In addition to finding common and unique information from Twitter and the app store, sentiment and content analysis were also performed for 70 randomly selected apps. From this, we found that tweets provided more critical and objective views on apps than reviews from the app store. These results show that app store review mining is indeed not enough; other information sources ultimately provide added value and information for app developers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. https://www.appannie.com.

  2. http://www.appbrain.com.

  3. http://www.searchman.com.

  4. http://nutch.apache.org/.

  5. https://support.Twitter.com/articles/64986.

References

  • Agarwal A, Xie B, Vovsha I, Rambow O, Passonneau R (2011) Sentiment analysis of twitter data. In: Proceedings of the workshop on languages in social media. Association for Computational Linguistics, pp 30–38

  • Benevenuto F, Magno G, Rodrigues T, Almeida V (2010) Detecting spammers on twitter. In: Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), vol 6, pp 12

  • Blackman NJ-M, Koval J J (2000) Interval estimation for cohen’s kappa as a measure of agreement. Statist Med 19(5):723–741

    Article  Google Scholar 

  • Blei D M, Ng A Y, Jordan M I (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Bougie G, Starke J, Storey M-A, German D M (2011) Towards understanding twitter use in software engineering: preliminary findings, ongoing challenges and future questions. In: Proceedings of the 2nd international workshop on Web 2.0 for software engineering. ACM, pp 31–36

  • Chang J, Gerrish S, Wang C, Boyd-Graber J L, Blei DM (2009) Reading tea leaves: How humans interpret topic models. In: Advances in neural information processing systems, pp 288–296

  • Chen N, Lin J, Hoi S C, Xiao X, Zhang B (2014) Ar-miner: mining informative reviews for developers from mobile app marketplace. In: Proceedings of the 36th international conference on software engineering. ACM, pp 767–778

  • Ciurumelea A, Schaufelbühl A, Panichella S, Gall HC (2017) Analyzing reviews and code of mobile apps for better release planning. In: Software analysis, evolution and reengineering (SANER). IEEE, pp 91– 102

  • Di Sorbo A, Panichella S, Alexandru C V, Shimagaki J, Visaggio C A, Canfora G, Gall H C (2016) What would users change in my app? Summarizing app reviews for recommending software changes. In: Proceedings of the 2016 24th ACM SIGSOFT international symposium on foundations of software engineering. ACM, pp 499–510

  • Di Sorbo A, Panichella S, Alexandru C V, Visaggio C A, Canfora G (2017) Surf: summarizer of user reviews feedback. In: Proceedings of the 39th international conference on software engineering companion. IEEE Press, pp 55–58

  • Gibbons J D, Chakraborti S (2011) Nonparametric statistical inference. Springer

  • Gomez M, Martinez M, Monperrus M, Rouvoy R (2015) When app stores listen to the crowd to fight bugs in the wild. In: Proceedings of the 37th international conference on software engineering (ICSE), vol 2. IEEE Press, pp 567–570

  • Gu X, Kim S (2015) What parts of your apps are loved by users? (t). In: Automated software engineering (ASE). IEEE, pp 760–770

  • Guzman E, Alkadhi R, Seyff N (2017) An exploratory study of twitter messages about software applications. Requir Eng 22(3):387–412

    Article  Google Scholar 

  • Guzman E, Ibrahim M, Glinz M (2017) Mining twitter messages for software evolution. In: Proceedings of the 39th international conference on software engineering companion. IEEE Press, pp 283– 284

  • Harman M, Jia Y, Zhang Y (2012) App store mining and analysis: Msr for app stores. In: Proceedings of the 9th IEEE working conference on mining software repositories. IEEE Press, pp 108–111

  • Hong L, Davison B D (2010) Empirical study of topic modeling in twitter. In: Proceedings of the first workshop on social media analytics. ACM, pp 80–88

  • Hutto CJ, Gilbert E Vader: a parsimonious rule-based model for sentiment analysis of social media text. In: Eighth international AAAI conference on weblogs and social media, pp 50–60

  • Iacob C, Harrison R (2013) Retrieving and analyzing mobile apps feature requests from online reviews. In: 10th IEEE Working conference on mining software repositories (MSR). IEEE, pp 41–44

  • Jivani A G et al. (2011) A comparative study of stemming algorithms. Int J Comp Tech Appl 2(6):1930– 1938

    Google Scholar 

  • Jongeling R, Datta S, Serebrenik A (2015) Choosing your weapons: on sentiment analysis tools for software engineering research. In: Software maintenance and evolution (ICSME). IEEE, pp 531– 535

  • Kittur A, Chi E H, Suh B (2008) Crowdsourcing user studies with mechanical turk. In: Proceedings of the SIGCHI conference on human factors in computing systems. ACM, pp 453–456

  • Kouloumpis E, Wilson T, Moore J D (2011) Twitter sentiment The good the bad and the omg!. Icwsm 11:538–541

    Google Scholar 

  • Liu B (2010) Sentiment analysis and subjectivity. Handbook Natural Lang Process 2:627–666

    Google Scholar 

  • Loper E, Bird S (2002) Nltk: the natural language toolkit. In: Proceedings of the ACL-02 workshop on effective tools and methodologies for teaching natural language processing and computational linguistics, vol 1. Association for Computational Linguistics, pp 63–70

  • Maalej W, Nabil H (2015) Bug report, feature request, or simply praise? On automatically classifying app reviews. In: IEEE 23rd international requirements engineering conference (RE). IEEE, pp 116– 125

  • Maalej W, Nayebi M, Johann T, Ruhe G (2016) Toward data-driven requirements engineering. IEEE Softw 33(1):48–54

    Article  Google Scholar 

  • Manning C D, Surdeanu M, Bauer J, Finkel J R, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: ACL (System Demonstrations), pp 55–60

  • Martin W, Harman M, Jia Y, Sarro F, Zhang Y (2015) The app sampling problem for app store mining. In: IEEE/ACM 12th Working conference on mining software repositories. IEEE, pp 123–133

  • Martin W, Sarro F, Jia Y, Zhang Y, Harman M (2016) A survey of app store analysis for software engineering. RN 16:02

    Google Scholar 

  • Mohammad S M, Kiritchenko S, Zhu X (2013) Nrc-canada: building the state-of-the-art in sentiment analysis of tweets. arXiv:http://arXiv.org/abs/1308.6242

  • Naaman M, Boase J, Lai C -H (2010) Is it really about me?: message content in social awareness streams. In: Proceedings of the 2010 ACM conference on computer supported cooperative work. ACM, pp 189– 192

  • Nayebi M, Ruhe G (2015) Analytical product release planning. In: The Art and science of analyzing software data. Morgan Kaufmann, pp 550–580

  • Nayebi M, Adams B, Ruhe G (2016) Release practices for mobile apps–what do users and developers think? In: Software analysis, evolution, and reengineering (SANER), vol 1. IEEE, pp 552– 562

  • Nayebi M, Farrahi H, Ruhe G, Cho H (2017) App store mining is not enough. In: Proceedings of the 39th international conference on software engineering companion. IEEE Press, pp 152–154

  • Nayebi M, Quapp R, Ruhe G, Marbouti M, Maurer F (2017) Crowdsourced exploration of mobile app features: a case study of the fort mcmurray wildfire. In: Proceedings of the 39th international conference on software engineering: software engineering in society track. IEEE Press, pp 57–66

  • Nielsen F Å (2011) A new anew: Evaluation of a word list for sentiment analysis in microblogs. arXiv:http://arXiv.org/abs/1103.2903

  • O’Connor B, Krieger M, Ahn D (2010) Tweetmotif: exploratory search and topic summarization for twitter. In: ICWSM

  • Palomba F, Linares-Vásquez M, Bavota G, Oliveto R, Di Penta M, Poshyvanyk D, De Lucia A (2015) User reviews matter! tracking crowdsourced reviews to support evolution of successful apps. In: Software maintenance and evolution (ICSME). IEEE, pp 291–300

  • Paolacci G, Chandler J, Ipeirotis P G (2010) Running experiments on amazon mechanical turk. Judgment Decis Making 5(5):411–419

    Google Scholar 

  • Porter M F (1980) An algorithm for suffix stripping. Program 14(3):130–137

    Article  Google Scholar 

  • Prasetyo PK, Lo D, Achananuparp P, Tian Y, Lim E-P (2012) Automatic classification of software related microblogs. In: Software maintenance (ICSM). IEEE, pp 596–599

  • Ramage D, Dumais S T, Liebling D J (2010) Characterizing microblogs with topic models. ICWSM 10:1–1

    Google Scholar 

  • Rosenthal S, Nakov P, Kiritchenko S, Mohammad S M, Ritter A, Stoyanov V (2015) Semeval-2015 task 10: sentiment analysis in twitter. In: Proceedings of SemEval-2015

  • Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv (CSUR) 34(1):1–47

    Article  Google Scholar 

  • Smedt T D, Daelemans W (2012) Pattern for python. J Mach Learn Res 13:2063–2067

    MATH  Google Scholar 

  • Thelwall M, Buckley K, Paltoglou G (2011) Sentiment in twitter events. J Am Soc Inf Sci Technol 62(2):406–418

    Article  Google Scholar 

  • Tian Y, Lo D (2014) An exploratory study on software microblogger behaviors. In: 2014 IEEE 4th Workshop on mining unstructured data (MUD). IEEE, pp 1–5

  • Villarroel L, Bavota G, Russo B, Oliveto R, Di Penta M (2016) Release planning of mobile apps based on user reviews. In: Proceedings of the 38th international conference on software engineering. ACM, pp 14–24

  • Wang X, Kuzmickaja I, Abrahamsson P (2014) Microblogging in open source software development, 8–12

  • Wiese IS, da Silva JT, Steinmacher I, Treude C, Gerosa MA (2016) Who is who in the mailing list? Comparing six disambiguation heuristics to identify multiple addresses of a participant. In: 2016 IEEE International conference on software maintenance and evolution (ICSME). IEEE, pp 345–355

  • Williams G, Mahmoud A (2017) Mining twitter data for a more responsive software engineering process. In: Proceedings of the 39th international conference on software engineering companion. IEEE Press, pp 280–282

  • Wohlin C, Runeson P, Höst M, Ohlsson M C, Regnell B, Wesslén A (2012) Experimentation in software engineering. Springer Science & Business Media

Download references

Acknowledgments

We would like to thank Homayoon Farrahi and Ada Lee for their help on this study. We thank all the anonymous reviewers and the Associate editor for their valuable comments and suggestions. This research was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant 250343-12.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maleknaz Nayebi.

Additional information

Communicated by: Yasutaka Kamei

Appendix: Crowdsourced evaluation of RQ2 and RQ3

Appendix: Crowdsourced evaluation of RQ2 and RQ3

In Section 5 we discussed why and how we used crowdsourcing:

First, :

we used crowdsourcing to confirm the results of similarity analysis (cosine similarity) between tweet topics and review topics to confirm (i) if assigning a tweet topic to a review topic is correct and (ii) if we missed to assign a tweet topic to a review topic (RQ2).

Second, :

we used crowdsourcing to compare the degree of specification and understandability (RQ3).

“The crowd” is composed of workers that are unknown in person to the authors. To elaborate on the validity of crowdsourcing results, we hired three developers being known to the authors from similar former work. We asked them to perform the same tasks done by the crowd. Overall across RQ2 and RQ3, the average Fleiss Kappa among three developers and across different tasks was 0.84, which indicates an almost perfect agreement. We then compared the results that were achieved by the crowd with the results achieved by the three developers.

Evaluating crowdsorced results in RQ2

We randomly selected 500 tweets and 500 review topics. Among them, 250 topics were marked as similar, and 250 topics were marked as different by the crowd. We asked three app developers we have formerly worked with to manually label these 500 topics. They answered the same questions as the crowd (Fig. 5). We compared the results and found that our developers classified 99.4% of the topics in the same way as we did base on the crowd’s evaluation.

We then compared the results that achieved by the crowd with the results achieved by the three developers as presented in Table 3.

Table 3 Comparison of crowdsourcing with known developers results for RQ2

Evaluating crowdsorced results in RQ3

We randomly selected 250 tweets and 250 reviews and asked three developers to judge the degree of specification both for degree of specification and degree of undrestandability. We compared the results of this task as it was done by crowd versus the three developers in Tables 4 and 5.

17.2% of reviews and 16.4% of tweets were classified in a different specification category by developers in comparison with the crowd. With the same set up, we asked developers to evaluate the degree of understandability and compared the results with the ones received from the crowd:

11.2% of the reviews and 17.6% of the tweets were classified in a different undrestandability category by developers in comparison with the crowd.

Table 4 Comparison of crowdsourced results with the results from known developers in RQ3 for degree of specification
Table 5 Comparison of crowdsourced results with the results from known developers in RQ3 for undrestandability

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nayebi, M., Cho, H. & Ruhe, G. App store mining is not enough for app improvement. Empir Software Eng 23, 2764–2794 (2018). https://doi.org/10.1007/s10664-018-9601-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10664-018-9601-1

Keywords

Navigation