Skip to main content

Identifying the Main Problems in IT Auditing: A Comparison Between Unsupervised and Supervised Learning

  • Conference paper
  • First Online:
Electronic Government and the Information Systems Perspective (EGOVIS 2016)

Abstract

One of the main challenges faced by the Brazilian Office of the Comptroller General (CGU) is applying consistent knowledge discovery tools and methodologies to learn from several years of auditing experience from hundreds of thousands of auditing reports with millions of pages it produced during these years. More specifically, we tackle the problem of identifying the most common topics in a context of Information Technology audits performed in Brazil since 2011. In order to tackle this problem, we compare two different approaches, supervised and unsupervised learning. On the one hand, the supervised learning approach generated a model that achieved around 73 % accuracy for seven categories using random forest. On the other hand, the unsupervised learning approach using Latent Dirichlet Allocation (LDA) generated a model with five topics, which was considered the best model based on the validation performed by the subject matter experts (SME) from CGU. Nevertheless, it is important to note that both approaches, although implemented independently, generated very similar topics. This also reinforces the success in identifying the main problems found during all these years of IT auditing at CGU using consistent and well-known knowledge discovery methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    http://cran.r-project.org/web/packages/tm/index.html.

  2. 2.

    https://github.com/cpsievert/LDAvis.

References

  1. Alejo, R., Valdovinos, R., Garca, V., Pacheco-Sanchez, J.: A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recogn. Lett. 34(4), 380–388 (2013)

    Article  Google Scholar 

  2. Andrzejewski, D., Buttler, D.: Latent topic feedback for information retrieval. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 600–608. ACM (2011)

    Google Scholar 

  3. Berry, M.W., Castellanos, M.: Survey of Text Mining II. Springer, New York (2008)

    Book  Google Scholar 

  4. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  5. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  MATH  Google Scholar 

  6. Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)

    Google Scholar 

  7. Cheng, H., Yan, X., Han, J., Yu, P.S.: Direct discriminative pattern mining for effective classification. In: IEEE 24th International Conference on Data Engineering, ICDE 2008, pp. 169–178. IEEE (2008)

    Google Scholar 

  8. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)

    Article  Google Scholar 

  9. Ganguly, D., Jones, G.J.: Partially labeled supervised topic models for retrieving similar questions in CQA forums, pp. 161–170. ACM Press (2015)

    Google Scholar 

  10. Jurasfsky, D., Martin, J.H.: Speech and Language Processing. Prentice Hall, Upper Saddle River (1998). Stuart Russell and Peter Norvig

    Google Scholar 

  11. Korde, V.: Text classification and classifiers: a survey. Int. J. Artif. Intell. Appl. 3(2), 85–99 (2012)

    Google Scholar 

  12. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159 (1977)

    Article  MATH  Google Scholar 

  13. Maia, P., Carvalho, R.N., Ladeira, M., Rocha, H., Mendes, G.: Application of text mining techniques for classification of documents: a study of automation ofcomplaints screening in a Brazilian Federal Agency

    Google Scholar 

  14. Martins, C.A., Monard, M.C., Matsubara, E.T.: Reducing the dimensionality of bag-of-words text representation used by learning algorithms. In: Proceedings of the Third IASTED International Conference on Artificial Intelligence and Applications (AIA 2003), Benalmdena, Espanha, vol. 38 (2003)

    Google Scholar 

  15. Gordon, M., Kochen, M.: Recall-precision trade-off: a derivation (1988)

    Google Scholar 

  16. Pyo, S., Kim, E., kim, M.: LDA-based unified topic modeling for similar TV usergrouping and TV program recommendation. IEEE Trans. Cybern. 45(8), 1476–1490 (2015)

    Article  Google Scholar 

  17. Ramos, J.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning (2003)

    Google Scholar 

  18. Sales, L.: Risk prevention of public procurement in the Brazilian government using credit scoring. OBEGEF working papers, OBEGEF - Observatrio de Economia e Gesto de Fraude (2013)

    Google Scholar 

  19. Tan, C.-M., Wang, Y.-F., Lee, C.-D.: The use of bigrams to enhance text categorization. Inf. Process. Manage. 38(4), 529–546 (2002)

    Article  MATH  Google Scholar 

  20. Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185. ACM (2006)

    Google Scholar 

Download references

Acknowledgments

The authors would like to thank the Federal Internal Control Secretariat (SFC, in Portuguese), specially the Coordination of Control and Core Audit of Information Technology (GSNTI, in Portuguese) for the partnership during this work. Finally the authour thank the Director of the Department of Research and Strategy Information, Gilson Libório, and CGU for their support and for allowing the publication of this work.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patrícia Maia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Maia, P., Sales, L., Carvalho, R.N. (2016). Identifying the Main Problems in IT Auditing: A Comparison Between Unsupervised and Supervised Learning. In: Kő, A., Francesconi, E. (eds) Electronic Government and the Information Systems Perspective. EGOVIS 2016. Lecture Notes in Computer Science(), vol 9831. Springer, Cham. https://doi.org/10.1007/978-3-319-44159-7_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-44159-7_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-44158-0

  • Online ISBN: 978-3-319-44159-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics