Skip to main content

Automated Identification of Sensitive Data via Flexible User Requirements

  • Conference paper
  • First Online:
Security and Privacy in Communication Networks (SecureComm 2018)

Abstract

Protecting sensitive data in web and mobile applications requires identifying sensitive data, which typically needs intensive manual efforts. In addition, deciding sensitive data subjects to users’ requirements and the application context. Existing research efforts on identifying sensitive data from its descriptive texts focus on keyword/phrase searching. These approaches can have high false positives/negatives as they do not consider the semantics of the descriptions. In this paper, we propose S3, an automated approach to identify sensitive data based on user requirements. It considers semantic, syntactic and lexical information comprehensively, aiming to identify sensitive data by the semantics of its descriptive texts. We introduce the notion concept space to represent the user’s notion of privacy, by which our approach can support flexible user requirements in defining sensitive data. Our approach is able to learn users’ preferences from readable concepts initially provided by users, and automatically identify related sensitive data. We evaluate our approach on over 18,000 top popular applications from Google Play Store. S3 achieves an average precision of 89.2%, and average recall 95.8% in identifying sensitive data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    S3 stands for semantics, syntax, and sentiment.

References

  1. Avdiienko, V., Kuznetsov, K., Rommelfanger, I., Rau, A., Gorla, A., Zeller, A.: Detecting behavior anomalies in graphical user interfaces. In: Proceedings of the 39th International Conference on Software Engineering Companion (ICSE-C). IEEE (2017)

    Google Scholar 

  2. Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the 7th International Conference on Language Resources and Evaluation. European Language Resources Association (2010)

    Google Scholar 

  3. Budianto, E., Jia, Y., Dong, X., Saxena, P., Liang, Z.: You can’t be me: enabling trusted paths and user sub-origins in web browsers. In: Stavrou, A., Bos, H., Portokalidis, G. (eds.) RAID 2014. LNCS, vol. 8688, pp. 150–171. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11379-1_8

    Chapter  Google Scholar 

  4. Bursztein, E., Soman, C., Boneh, D., Mitchell, J.C.: SessionJuggler: secure web login from an untrusted terminal using session hijacking. In: Proceedings of the 21st International Conference on World Wide Web (WWW). ACM (2012)

    Google Scholar 

  5. CNBC: Driver’s license, credit card numbers: The equifax hack is way worse than consumers knew. https://www.cnbc.com/2018/02/12/the-equifax-hack-is-way-worse-than-consumers-knew.html

  6. Cunningham, P., Delany, S.J.: K-nearest neighbour classifiers. Multiple Classif. Syst. 34, 1–17 (2007)

    Google Scholar 

  7. Enck, W., et al.: TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. In: Proceedings of the 9th USENIX Conference on Operating Systems Design and Implementation (USENIX OSDI). USENIX Association (2010)

    Google Scholar 

  8. Finkel, J.R., Grenager, T., Manning, C.: Incorporating non-local information into information extraction systems by gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL). Association for Computational Linguistics (2005)

    Google Scholar 

  9. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI). Morgan Kaufmann Publishers Inc. (2007)

    Google Scholar 

  10. Huang, J., et al.: SUPOR: precise and scalable sensitive user input detection for android apps. In: 24th USENIX Security Symposium (USENIX Security). USENIX Association (2015)

    Google Scholar 

  11. Jurafsky, D., Martin, J.H.: Speech and Language Processing, vol. 3. Pearson, London (2014)

    Google Scholar 

  12. Klein, D., Manning, C.D.: Fast exact inference with a factored model for natural language parsing. In: Proceedings of the 15th International Conference on Neural Information Processing Systems (NIPS). MIT Press (2002)

    Google Scholar 

  13. Kong, D., Cen, L., Jin, H.: AUTOREB: automatically understanding the review-to-behavior fidelity in android applications. In: Proceedings of the 22nd Conference on Computer and Communications Security (CCS). ACM (2015)

    Google Scholar 

  14. LDC: English gigaword fifth edition. https://catalog.ldc.upenn.edu/LDC2011T07

  15. Li, X., Hu, H., Bai, G., Jia, Y., Liang, Z., Saxena, P.: DroidVault: a trusted data vault for android devices. In: Proceedings of the 19th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE (2014)

    Google Scholar 

  16. Liao, X., Yuan, K., Wang, X., Li, Z., Xing, L., Beyah, R.: Acing the IOC game: toward automatic discovery and analysis of open-source cyber threat intelligence. In: Proceedings of Conference on Computer and Communications Security (CCS). ACM (2016)

    Google Scholar 

  17. Lu, K., et al.: Checking more and alerting less: detecting privacy leakages via enhanced data-flow analysis and peer voting. In: Proceedings of the Network and Distributed System Security Symposium (NDSS) (2015)

    Google Scholar 

  18. Mannan, M., van Oorschot, P.C.: Using a personal device to strengthen password authentication from an untrusted computer. In: Dietrich, S., Dhamija, R. (eds.) FC 2007. LNCS, vol. 4886, pp. 88–103. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-77366-5_11

    Chapter  Google Scholar 

  19. Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations, pp. 55–60 (2014). http://www.aclweb.org/anthology/P/P14/P14-5010

  20. Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS). Curran Associates Inc. (2013)

    Google Scholar 

  21. Nan, Y., Yang, M., Yang, Z., Zhou, S., Gu, G., Wang, X.: UIPicker: user-input privacy identification in mobile applications. In: Proceedings of the 24th USENIX Security Symposium (USENIX Security). USENIX Association (2015)

    Google Scholar 

  22. Olson, D.L., Delen, D.: Advanced Data Mining Techniques. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-76917-0

    Book  MATH  Google Scholar 

  23. Oprea, A., Balfanz, D., Durfee, G., Smetters, D.K.: Securing a remote terminal application with a mobile trusted device. In: Proceedings of the 20th Annual Computer Security Applications Conference (ACSAC). IEEE (2004)

    Google Scholar 

  24. Pandita, R., Xiao, X., Yang, W., Enck, W., Xie, T.: WHYPER: towards automating risk assessment of mobile applications. In: Proceedings of the 22nd USENIX Security Symposium (USENIX Security). USENIX Association (2013)

    Google Scholar 

  25. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP) (2014)

    Google Scholar 

  26. Qu, Z., Rastogi, V., Zhang, X., Chen, Y., Zhu, T., Chen, Z.: AutoCog: measuring the description-to-permission fidelity in android applications. In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM (2014)

    Google Scholar 

  27. Rastogi, V., Chen, Y., Enck, W.: AppsPlayground: automatic security analysis of smartphone applications. In: Proceedings of the 3rd ACM Conference on Data and Application Security and Privacy. ACM (2013)

    Google Scholar 

  28. Roalter, L., Kranz, M., Diewald, S., Möller, A., Synnes, K.: The smartphone as mobile authorization proxy. In: Proceedings of the 14th International Conference on Computer Aided Systems Theory (EUROCAST), pp. 306–307 (2013)

    Google Scholar 

  29. Sharp, R., Madhavapeddy, A., Want, R., Pering, T.: Enhancing web browsing security on public terminals using mobile composition. In: Proceedings of the 6th International Conference on Mobile Systems, Applications, and Services (MobiSys). ACM (2008)

    Google Scholar 

  30. Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) (2013)

    Google Scholar 

  31. Steinbach, M., Karypis, G., Kumar, V., et al.: A comparison of document clustering techniques. In: KDD Workshop on Text Mining, Boston, vol. 400, pp. 525–526 (2000)

    Google Scholar 

  32. Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (NAACL). Association for Computational Linguistics (2003)

    Google Scholar 

  33. Wikipedia: Yahoo! data breaches. https://en.wikipedia.org/wiki/Yahoo!_data_breaches

  34. Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR). ACM (1996)

    Google Scholar 

  35. Yu, L., Luo, X., Qian, C., Wang, S.: Revisiting the description-to-behavior fidelity in android applications. In: Proceedings of the 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER). IEEE (2016)

    Google Scholar 

  36. Zhou, Y., Jiang, X.: Detecting passive content leaks and pollution in android applications. In: Proceedings of the 20th Network and Distributed System Security Symposium (NDSS) (2013)

    Google Scholar 

  37. Zhou, Y., Evans, D.: Protecting private web content from embedded scripts. In: Atluri, V., Diaz, C. (eds.) ESORICS 2011. LNCS, vol. 6879, pp. 60–79. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23822-2_4

    Chapter  Google Scholar 

Download references

Acknowledgment

This research is supported by the National Research Foundation, Prime Ministers Office, Singapore under its National Cybersecurity R&D Programme (Grant No. NRF2015NCR-NCR002-001).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ziqi Yang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 ICST Institute for Computer Sciences, Social Informatics and Telecommunications Engineering

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, Z., Liang, Z. (2018). Automated Identification of Sensitive Data via Flexible User Requirements. In: Beyah, R., Chang, B., Li, Y., Zhu, S. (eds) Security and Privacy in Communication Networks. SecureComm 2018. Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 254. Springer, Cham. https://doi.org/10.1007/978-3-030-01701-9_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-01701-9_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-01700-2

  • Online ISBN: 978-3-030-01701-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics