Using frame semantics for classifying and summarizing application store reviews

  • Nishant Jha
  • Anas Mahmoud


Text mining techniques have been recently employed to classify and summarize user reviews on mobile application stores. However, due to the inherently diverse and unstructured nature of user-generated online textual data, text-based review mining techniques often produce excessively complicated models that are prone to overfitting. In this paper, we propose a novel approach, based on frame semantics, for app review mining. Semantic frames help to generalize from raw text (individual words) to more abstract scenarios (contexts). This lower-dimensional representation of text is expected to enhance the predictive capabilities of review mining techniques and reduce the chances of overfitting. Specifically, our analysis in this paper is two-fold. First, we investigate the performance of semantic frames in classifying informative user reviews into various categories of actionable software maintenance requests. Second, we propose and evaluate the performance of multiple summarization algorithms in generating concise and representative summaries of informative reviews. Three different datasets of app store reviews, sampled from a broad range of application domains, are used to conduct our experimental analysis. The results show that semantic frames can enable an efficient and accurate review classification process. However, in review summarization tasks, our results show that text-based summarization generates more comprehensive summaries than frame-based summarization. Finally, we introduces MARC 2.0, a review classification and summarization suite that implements the algorithms investigated in our analysis.


Requirements elicitation Application store Classification Summarization FrameNet Frame semantics 



This work was supported in part by the Louisiana Board of Regents Research Competitiveness Subprogram (LA BoR-RCS), contract number: LEQSF(2015-18)-RD-A-07.


  1. Agarwal A, Balasubramanian S, Kotalwar A, Zheng J, Rambow O (2014) Frame semantic tree kernels for social network extraction from text. In: Conference of the European chapter of the association for computational linguistics, pp 211–219Google Scholar
  2. Baker C, Fillmore C, Lowe J (1998) The Berkeley Framenet project. In: International conference on computational linguistics, pp 86–90Google Scholar
  3. Bano M, Zowghi D (2015) A systematic review on the relationship between user involvement and system success. Inf Softw Technol 58:148–169CrossRefGoogle Scholar
  4. Barker E, Paramita M, Funk A, Kurtic E, Aker A, Foster J, Hepple M, Gaizauskas R (2016) What’s the issue here?: task-based evaluation of reader comment summarization systems. In: International conference on language resources and evaluation, pp 23–28Google Scholar
  5. Barzilay R, McKeown K, Elhadad M (1999) Information fusion in the context of multi-document summarization. In: Annual meeting of the association for computational linguistics on computational linguistics, pp 550–557Google Scholar
  6. Basole R, Karla J (2012) Value transformation in the mobile service ecosystem: a study of app store emergence and growth. Service Science 4(1):24–41CrossRefGoogle Scholar
  7. Berry D (2017) Evaluation of tools for hairy requirements and software engineering tasks. In: International requirements engineering conference workshops, pp 284–291Google Scholar
  8. Blei D, Ng A, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022zbMATHGoogle Scholar
  9. Brin S, Page L (1998) The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems 30(1–7):107–117CrossRefGoogle Scholar
  10. Brusilovsky P, Kobsa A, Nejdl W (2007) The adaptive web: methods and strategies of web personalization. Springer, Berlin, pp 335–336CrossRefGoogle Scholar
  11. Burges C (1998) A tutorial on Support Vector Machines for pattern recognition. Data Min Knowl Disc 2(2):121–167CrossRefGoogle Scholar
  12. Cai L, Hofmann T (2004) Hierarchical document categorization with support vector machines. In: International conference on information and knowledge management, pp 78–87Google Scholar
  13. Carreńo G, Winbladh K (2013) Analysis of user comments: an approach for software requirements evolution. In: International conference on software engineering, pp 582–591Google Scholar
  14. Chen N, Lin J, Hoi S, Xiao X, Zhang B (2014) AR-Miner: mining informative reviews for developers from mobile app marketplace. In: International conference on software engineering, pp 767–778Google Scholar
  15. Cheung J (2008) Comparing abstractive and extractive summarization of evaluative text: controversiality and content selection. B. Sc. (Hons.) Thesis in The Department of Computer Science of the Faculty of Science, University of British ColumbiaGoogle Scholar
  16. Ciurumelea A, Schaufelbühl A, Panichella S, Gall H (2017) Analyzing reviews and code of mobile apps for better release planning. In: International conference on software analysis, evolution and reengineering, pp 91–102Google Scholar
  17. Das D, Schneider N, Chen D, Smith N (2010) SEMAFOR 1.0: a probabilistic frame-semantic parser. Tech. rep., Report number: CMU-LTI-10-001, Carnegie Mellon UniversityGoogle Scholar
  18. Dean A, Voss D (1999) Design and analysis of experiments. Springer, BerlinCrossRefzbMATHGoogle Scholar
  19. Dumais S, Chen H (2000) Hierarchical classification of Web content. In: ACM international conference on research and development in information retrieval, pp 256–263Google Scholar
  20. Erkan G, Radev D (2004) Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22(1):457–479Google Scholar
  21. Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: International conference on knowledge discovery and data mining, pp 226–231Google Scholar
  22. Fillmore C (1976) Frame semantics and the nature of language. In: Annals of the New York academy of sciences: conference on the origin and development of language and speech, pp 20–32Google Scholar
  23. Fleischman M, Kwon N, Hovy E (2003) Maximum entropy models for FrameNet classification. In: Empirical methods in natural language processing, pp 49–56Google Scholar
  24. Groen E, Kopczyǹska S, Hauer M, Krafft T, Doerr J (2017) Users: the hidden software product quality experts?: a study on how app users report quality aspects in online reviews. In: International requirements engineering conference, pp 80–89Google Scholar
  25. Guzman E, Maalej W (2014) How do users like this feature? A fine grained sentiment analysis of app reviews. In: Requirements engineering conference, pp 153–162Google Scholar
  26. Guzman E, El-Haliby M, Bruegge B (2015) Ensemble methods for app review classification: an approach for software evolution. In: International conference on automated software engineering, pp 771–776Google Scholar
  27. Guzman E, Alkadhi R, Seyff N (2016) A needle in a haystack: what do Twitter users say about software?. In: International requirements engineering conference, pp 96–105Google Scholar
  28. Guzman E, Ibrahim M, Glinz M (2017) A little bird told me: mining tweets for requirements and software evolution. In: International requirements engineering conference, pp 11–20Google Scholar
  29. Ha E, Wagner D (2013) Do Android users write about electric sheep? Examining consumer reviews in Google Play. In: Consumer communications and networking conference, pp 149–157Google Scholar
  30. Hahn U, Mani I (2000) The challenges of automatic summarization. Computer 33(11):29–36CrossRefGoogle Scholar
  31. Hasa K, Ng V (2013) Frame semantics for stance classification. In: Computational natural language learning, pp 124–132Google Scholar
  32. Huffman-Hayes J, Dekhtyar A, Sundaram S (2006) Advancing candidate link generation for requirements tracing: the study of methods. IEEE Trans Softw Eng 32 (1):4–19CrossRefGoogle Scholar
  33. Iacob C, Harrison R (2013) Retrieving and analyzing mobile apps feature requests from online reviews. In: Mining software repositories, pp 41–44Google Scholar
  34. Inouye D, Kalita J (2011) Comparing Twitter summarization algorithms for multiple post summaries. In: International conference on social computing and international conference on privacy, security, risk and trust, pp 298–306Google Scholar
  35. Jha N, Mahmoud A (2017a) MARC: a mobile application review classifier. In: Requirements engineering: foundation for software quality: workshops, pp 1–6Google Scholar
  36. Jha N, Mahmoud A (2017b) Mining user requirements from application store reviews using frame semantics. In: Requirements engineering: foundation for software quality, pp 1–15Google Scholar
  37. Joachims T (1998) Text categorization with Support Vector Machines: learning with many relevant features. In: European conference on machine learning, pp 137–142Google Scholar
  38. Johann T, Stanik C, Alizadeh A, Maalej W (2017) Safe: a simple approach for feature extraction from app descriptions and app reviews. In: International requirements engineering conference, pp 21–31Google Scholar
  39. Khabiri E, Caverlee J, Hsu C (2011) Summarizing user-contributed comments. In: International AAAI conference on Weblogs and social media, pp 534–537Google Scholar
  40. Khalid H, Shihab E, Nagappan M, Hassan A (2015) What do mobile app users complain about? IEEE Softw 32(3):70–77CrossRefGoogle Scholar
  41. Khatiwada S, Tushev M, Mahmoud A (2018) Just enough semantics: an information theoretic approach for ir-based software bug localization. Inf Softw Technol 93:45–57CrossRefGoogle Scholar
  42. Kim S, Han K, Rim H, Myaeng S (2006) Some effective techniques for Naive Bayes text classification. IEEE Trans Knowl Data Eng 18(11):1457–1466CrossRefGoogle Scholar
  43. Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: International joint conference on artificial intelligence, pp 1137–1143Google Scholar
  44. Langley P, Iba W, Thompson K (1992) An analysis of Bayesian classifiers. In: National conference on artificial intelligence, pp 223–228Google Scholar
  45. Lin C (2004) ROUGE: a package for automatic evaluation of summaries. In: Workshop on text summarization branches out, pp 74–81Google Scholar
  46. Lin C, Hovy E (2003) Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Conference of the North American chapter of the association for computational linguistics on human language technology, pp 71–78Google Scholar
  47. Llewellyn C, Grover C, Oberlander J (2014) Summarizing newspaper comments. In: International conference on Weblogs and social media, pp 599–602Google Scholar
  48. Lo D, Nagappan N, Zimmermann T (2015) How practitioners perceive the relevance of software engineering research. In: Joint meeting on foundations of software engineering, pp 415–425Google Scholar
  49. Lovins J (1968) Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 11:22–31Google Scholar
  50. Maalej W, Nabil H (2015) Bug report, feature request, or simply praise? On automatically classifying app reviews. In: Requirements engineering conference, pp 116–125Google Scholar
  51. Mackie S, McCreadie R, Macdonald C, Ounis I (2014) Comparing algorithms for microblog summarisation. In: Information access evaluation. Multilinguality, multimodality, and interaction: 5th international conference of the CLEF initiative, pp 153–159Google Scholar
  52. Martin W, Harman M, Jia Y, Sarro F, Zhang Y (2015) The app sampling problem for app store mining. In: Working conference on mining software repositories, pp 123–133Google Scholar
  53. Martin W, Sarro F, Jia Y, Zhang Y, Harman M (2017) A survey of app store analysis for software engineering. IEEE Trans Softw Eng 43(9):817–847CrossRefGoogle Scholar
  54. McCallum A, Nigam K (1998) A comparison of event models for Naive Bayes text classification. In: AAAI workshop on learning for text categorization, pp 41–48Google Scholar
  55. McCord M, Chuah M (2011) Spam detection on Twitter using traditional classifiers. In: international conference on Autonomic and trusted computing, pp 175–186Google Scholar
  56. Mcllroy S, Ali N, Khalid H, Hassan A (2016) Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews. Empir Softw Eng 21(3):1067–1106CrossRefGoogle Scholar
  57. Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA topic models for microblogs via tweet pooling and automatic labeling. In: International ACM SIGIR conference on research and development in information retrieval, pp 889–892Google Scholar
  58. Mitchell T (1997) Machine learning. McGraw-Hill, New YorkzbMATHGoogle Scholar
  59. Moschitti A, Morarescu P, Harabagiu S (2003) Open domain information extraction via automatic semantic labeling. In: The Florida artificial intelligence research society conference, pp 397–401Google Scholar
  60. Nayebi M, Cho H, Farrahi H, Ruhe G (2017) App store mining is not enough. In: International conference on software engineering companion, pp 152–154Google Scholar
  61. Nenkova A, Vanderwende L (2005) The impact of frequency on summarization. Tech. rep., Report number: MSR-TR-2005-101, Microsoft Research, Redmond, WashingtonGoogle Scholar
  62. Nichols J, Mahmud J, Drews C (2012) Summarizing sporting events using Twitter. In: ACM international conference on intelligent user interfaces, pp 189–198Google Scholar
  63. Otterbacher J, Erkan G, Radev D (2009) Biased lexrank: passage retrieval using random walks with question-based priors. Inf Process Manag 45(1):42–54CrossRefGoogle Scholar
  64. Pagano D, Maalej W (2013) User feedback in the AppStore: an empirical study. In: Requirements engineering conference, pp 125–134Google Scholar
  65. Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: bringing order to the Web. Tech. rep., Stanford University, StanfordGoogle Scholar
  66. Panichella S, Di Sorbo A, Guzman E, Visaggio C, Canfora G, Gall H (2015) How can I improve my app? Classifying user reviews for software maintenance and evolution. In: International conference on software maintenance and evolution, pp 281–290Google Scholar
  67. Petsas T, Papadogiannakis A, Polychronakis M, Markatos E, Karagiannis T (2013) Rise of the planet of the apps: a systematic study of the mobile app ecosystem. In: Conference on internet measurement conference, pp 277–290Google Scholar
  68. Platt J (1998) Fast training of Support Vector Machines using sequential minimal optimization. In: Schoelkopf B, Burges C, Smola A (eds) Advances in Kernel methods - Support Vector learning. MIT Press, pp 185–208Google Scholar
  69. Poché E, Jha N, Williams G, Staten J, Vesper M, Mahmoud A (2017) Analyzing user comments on YouTube coding tutorial videos. In: International conference on program comprehension, pp 196–206Google Scholar
  70. Powers D (2014) What the f-measure doesn’t measure. Tech. rep., Report number: KIT-14-001 School of Computer Science, Engineering and Mathematics, Flinders UniversityGoogle Scholar
  71. Quinlan J (1986) Induction of decision trees. Mach Learn 1(1):81–106Google Scholar
  72. Read J, Pfahringer B, Holmes G (2008) Multi-label classification using ensembles of pruned sets. In: IEEE international conference on data mining, pp 995–1000Google Scholar
  73. Runeson P (2003) Using students as experimental subjects—an analysis of graduate and freshmen PSP student data. In: Empirical assessment in software engineering, pp 95–102Google Scholar
  74. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47CrossRefGoogle Scholar
  75. Shen D, Lapata M (2007) Using semantic roles to improve question answering. In: Joint conference on empirical methods in natural language processing and computational natural language learning, pp 12–21Google Scholar
  76. Sinha S (2008) Answering questions about complex events. PhD thesis, Berkeley, CA, USAGoogle Scholar
  77. Sorbo A, Panichella S, Alexandru C, Shimagaki J, Visaggio C, Canfora G, Gall H (2016) What would users change in my app? Summarizing app reviews for recommending software changes. In: International symposium on foundations of software engineering, pp 499–510Google Scholar
  78. Squires L (2010) Enregistering internet language. Lang Soc 39(4):457–492CrossRefGoogle Scholar
  79. Steinwart I (2001) On the influence of the kernel on the consistency of Support Vector Machines. J Mach Learn Res 2:67–93MathSciNetzbMATHGoogle Scholar
  80. Tukey J (1949) Comparing individual means in the analysis of variance. Biometrics 5(2):99–114MathSciNetCrossRefGoogle Scholar
  81. Üstün B, Melssen W, Buydens L (2006) Facilitating the application of support vector regression by using a universal Pearson VII function based kernel. Chemometr Intell Lab Syst 81:29–40CrossRefGoogle Scholar
  82. Villarroel L, Bavota G, Russo B, Oliveto R, Di Penta M (2016) Release planning of mobile apps based on user reviews. In: International conference on software engineering, pp 14–24Google Scholar
  83. Wang A (2010) Don’t follow me: spam detection in Twitter. In: International conference on security and cryptography, pp 1–10Google Scholar
  84. Wang S, Manning C (2012) Baselines and bigrams: simple, good sentiment and topic classification. In: Annual meeting of the association for computational linguistics, pp 90–94Google Scholar
  85. Williams G, Mahmoud A (2017) Mining Twitter feeds for software user requirements. In: IEEE international requirements engineering conference, pp 1–10Google Scholar
  86. Xie B, Passonneau R, Wu L, Creamer G (2013) Semantic frames to predict stock price movement. In: Annual meeting of the association for computational linguistics, pp 873–883Google Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Division of Computer Science and EngineeringLouisiana State UniversityBaton RougeUSA

Personalised recommendations