Method of Deep Web Collection for Mobile Application Store Based on Category Keyword Searching

  • Guosheng Xu
  • Zhimin Wu
  • Chengze LiEmail author
  • Jinghua YanEmail author
  • Jing YuanEmail author
  • Zhiyong WangEmail author
  • Lu Wang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11611)


With the rapid development of mobile Internet, mobile Internet has come into the era of big data. The demand for data analysis of mobile applications has become more and more obvious, which puts forward higher requirements for the standard of mobile application information collection. Due to the large number of applications, almost all third-party app stores display only a small number of applications, and most of the information is hidden in the Deep Web database behind the query form. The existing crawler strategy cannot meet the demand. In order to solve the above problems, this paper proposes a collection method based on category keywords query to improve the crawl rate and integrity of the mobile app stores information collection. Firstly, get the information of application interfaces that include various kinds of applications by using the vertical crawler. Then extract the keywords that represent each category of applications by TF-IDF algorithm from the application name and description information. Finally, incremental crawling is performed by using keyword query-based acquisition method. Results show that this collection method effectively promoted information integrity and acquisition efficiency.


Deep Web TF-IDF algorithm Incremental crawling 



This research is supported by National Key R&D Program of China (No. 2018YFC0806900), Beijing Engineering Laboratory For security emulation & Hacking and Defense of IoV; This research is supported by National Secrecy Scientific Research Program of China (No. BMKY2018802-1) too.


  1. 1.
    iiMedia Research. Accessed 23 Dec 2016
  2. 2.
    Navigli, R., Velardi, P.: An analysis of ontology-based query expansion strategies. In: Proceedings of the 14th European Conference on Machine Learning, Croatia, pp. 42–49 (2003)Google Scholar
  3. 3.
    Hernández, I., Rivero, C.R., Ruiz, D.: World wide web (2018). Scholar
  4. 4.
    Olston, C., Najork, M.: Web crawling. Found. Trends Inf. Retriev. 4(3), 175246 (2010)zbMATHGoogle Scholar
  5. 5.
    Li, J.-R., Mao, Y.-F., Yang, K.: Improvement and application of TF * IDF algorithm. In: Liu, B., Chai, C. (eds.) ICICA 2011. LNCS, vol. 7030, pp. 121–127. Springer, Heidelberg (2011). Scholar
  6. 6.
    Li, W., Li, J., Zhang, B.: Saliency-GD: A TF-IDF analogy for landmark image mining. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds.) PCM 2017. LNCS, vol. 10735, pp. 477–486. Springer, Cham (2018). Scholar
  7. 7.
    Mahale, V.V., Dhande, M.T., Pandit, A.V.: Advanced web crawler for deep web interface using binary vector & page rank. In: 2nd International Conference on I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC)I-SMAC (IoT in Social, Mobile, Analytics and Cloud) (I-SMAC), 30–31 August 2018Google Scholar
  8. 8.
  9. 9.
    Zhang, L., et al.: Online modeling of esthetic communities using deep perception graph analytics. IEEE Trans. Multimedia 20(6), 1462–1474 (2018)CrossRefGoogle Scholar
  10. 10.
    Zhu, Z., Liang, J., Li, D., Yu, H., Liu, G.: Hot topic detection based on a refined TF-IDF algorithm. IEEE Access 7, 26996–27007 (2019)CrossRefGoogle Scholar
  11. 11.
    Baader, F.: The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, London (2003)zbMATHGoogle Scholar
  12. 12.
    Ntoulas, A., Zerfos, P., Cho, J.: Downloading textual hidden web content through key-word queries. In: Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 100–109. ACM (2005)Google Scholar
  13. 13.
    Zifei, D.: Design and Implementation of an Ajax Supported Deep Web Crawler Sys-tem. South China University of Technology, Guangdong (2015)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.School of Cyberspace SecurityBeijing University of Posts and TelecommunicationsBeijingChina
  2. 2.National Computer Network Emergency Response Technical Team/Coordination Center of China (CNCERT)BeijingChina

Personalised recommendations