Identifying the Main Problems in IT Auditing: A Comparison Between Unsupervised and Supervised Learning

Maia, Patrícia; Sales, Leonardo; Carvalho, Rommel N.

doi:10.1007/978-3-319-44159-7_17

Patrícia Maia¹⁶,
Leonardo Sales¹⁶ &
Rommel N. Carvalho^16,17

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9831))

Included in the following conference series:

International Conference on Electronic Government and the Information Systems Perspective

948 Accesses
1 Altmetric

Abstract

One of the main challenges faced by the Brazilian Office of the Comptroller General (CGU) is applying consistent knowledge discovery tools and methodologies to learn from several years of auditing experience from hundreds of thousands of auditing reports with millions of pages it produced during these years. More specifically, we tackle the problem of identifying the most common topics in a context of Information Technology audits performed in Brazil since 2011. In order to tackle this problem, we compare two different approaches, supervised and unsupervised learning. On the one hand, the supervised learning approach generated a model that achieved around 73 % accuracy for seven categories using random forest. On the other hand, the unsupervised learning approach using Latent Dirichlet Allocation (LDA) generated a model with five topics, which was considered the best model based on the validation performed by the subject matter experts (SME) from CGU. Nevertheless, it is important to note that both approaches, although implemented independently, generated very similar topics. This also reinforces the success in identifying the main problems found during all these years of IT auditing at CGU using consistent and well-known knowledge discovery methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Notes

References

Alejo, R., Valdovinos, R., Garca, V., Pacheco-Sanchez, J.: A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recogn. Lett. 34(4), 380–388 (2013)
Article Google Scholar
Andrzejewski, D., Buttler, D.: Latent topic feedback for information retrieval. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 600–608. ACM (2011)
Google Scholar
Berry, M.W., Castellanos, M.: Survey of Text Mining II. Springer, New York (2008)
Book Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)
Google Scholar
Cheng, H., Yan, X., Han, J., Yu, P.S.: Direct discriminative pattern mining for effective classification. In: IEEE 24th International Conference on Data Engineering, ICDE 2008, pp. 169–178. IEEE (2008)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Article Google Scholar
Ganguly, D., Jones, G.J.: Partially labeled supervised topic models for retrieving similar questions in CQA forums, pp. 161–170. ACM Press (2015)
Google Scholar
Jurasfsky, D., Martin, J.H.: Speech and Language Processing. Prentice Hall, Upper Saddle River (1998). Stuart Russell and Peter Norvig
Google Scholar
Korde, V.: Text classification and classifiers: a survey. Int. J. Artif. Intell. Appl. 3(2), 85–99 (2012)
Google Scholar
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159 (1977)
Article MATH Google Scholar
Maia, P., Carvalho, R.N., Ladeira, M., Rocha, H., Mendes, G.: Application of text mining techniques for classification of documents: a study of automation ofcomplaints screening in a Brazilian Federal Agency
Google Scholar
Martins, C.A., Monard, M.C., Matsubara, E.T.: Reducing the dimensionality of bag-of-words text representation used by learning algorithms. In: Proceedings of the Third IASTED International Conference on Artificial Intelligence and Applications (AIA 2003), Benalmdena, Espanha, vol. 38 (2003)
Google Scholar
Gordon, M., Kochen, M.: Recall-precision trade-off: a derivation (1988)
Google Scholar
Pyo, S., Kim, E., kim, M.: LDA-based unified topic modeling for similar TV usergrouping and TV program recommendation. IEEE Trans. Cybern. 45(8), 1476–1490 (2015)
Article Google Scholar
Ramos, J.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning (2003)
Google Scholar
Sales, L.: Risk prevention of public procurement in the Brazilian government using credit scoring. OBEGEF working papers, OBEGEF - Observatrio de Economia e Gesto de Fraude (2013)
Google Scholar
Tan, C.-M., Wang, Y.-F., Lee, C.-D.: The use of bigrams to enhance text categorization. Inf. Process. Manage. 38(4), 529–546 (2002)
Article MATH Google Scholar
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185. ACM (2006)
Google Scholar

Download references

Acknowledgments

The authors would like to thank the Federal Internal Control Secretariat (SFC, in Portuguese), specially the Coordination of Control and Core Audit of Information Technology (GSNTI, in Portuguese) for the partnership during this work. Finally the authour thank the Director of the Department of Research and Strategy Information, Gilson Libório, and CGU for their support and for allowing the publication of this work.

Author information

Authors and Affiliations

Department of Research and Strategic Information, Brazilian Office of the Comptroller General, SAS, Quadra 01, Bloco A, Edifício Darcy Ribeiro, Brasília, Distrito Federal, Brazil
Patrícia Maia, Leonardo Sales & Rommel N. Carvalho
Department of Computer Science, University of Brasília, Campus Universitário Darcy Ribeiro, Brasília, Distrito Federal, Brazil
Rommel N. Carvalho

Authors

Patrícia Maia
View author publications
You can also search for this author in PubMed Google Scholar
Leonardo Sales
View author publications
You can also search for this author in PubMed Google Scholar
Rommel N. Carvalho
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patrícia Maia .

Editor information

Editors and Affiliations

Corvinus University of Budapest, Budapest, Hungary
Andrea Kő
Institute of Legal Information Theory and Techniques, Florence, Italy
Enrico Francesconi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Maia, P., Sales, L., Carvalho, R.N. (2016). Identifying the Main Problems in IT Auditing: A Comparison Between Unsupervised and Supervised Learning. In: Kő, A., Francesconi, E. (eds) Electronic Government and the Information Systems Perspective. EGOVIS 2016. Lecture Notes in Computer Science(), vol 9831. Springer, Cham. https://doi.org/10.1007/978-3-319-44159-7_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-44159-7_17
Published: 07 August 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44158-0
Online ISBN: 978-3-319-44159-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics