Abstract
One of the main challenges faced by the Brazilian Office of the Comptroller General (CGU) is applying consistent knowledge discovery tools and methodologies to learn from several years of auditing experience from hundreds of thousands of auditing reports with millions of pages it produced during these years. More specifically, we tackle the problem of identifying the most common topics in a context of Information Technology audits performed in Brazil since 2011. In order to tackle this problem, we compare two different approaches, supervised and unsupervised learning. On the one hand, the supervised learning approach generated a model that achieved around 73 % accuracy for seven categories using random forest. On the other hand, the unsupervised learning approach using Latent Dirichlet Allocation (LDA) generated a model with five topics, which was considered the best model based on the validation performed by the subject matter experts (SME) from CGU. Nevertheless, it is important to note that both approaches, although implemented independently, generated very similar topics. This also reinforces the success in identifying the main problems found during all these years of IT auditing at CGU using consistent and well-known knowledge discovery methods.
References
Alejo, R., Valdovinos, R., Garca, V., Pacheco-Sanchez, J.: A hybrid method to face class overlap and class imbalance on neural networks and multi-class scenarios. Pattern Recogn. Lett. 34(4), 380–388 (2013)
Andrzejewski, D., Buttler, D.: Latent topic feedback for information retrieval. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 600–608. ACM (2011)
Berry, M.W., Castellanos, M.: Survey of Text Mining II. Springer, New York (2008)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Carletta, J.: Assessing agreement on classification tasks: the kappa statistic. Comput. Linguist. 22(2), 249–254 (1996)
Cheng, H., Yan, X., Han, J., Yu, P.S.: Direct discriminative pattern mining for effective classification. In: IEEE 24th International Conference on Data Engineering, ICDE 2008, pp. 169–178. IEEE (2008)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1–47 (2002)
Ganguly, D., Jones, G.J.: Partially labeled supervised topic models for retrieving similar questions in CQA forums, pp. 161–170. ACM Press (2015)
Jurasfsky, D., Martin, J.H.: Speech and Language Processing. Prentice Hall, Upper Saddle River (1998). Stuart Russell and Peter Norvig
Korde, V.: Text classification and classifiers: a survey. Int. J. Artif. Intell. Appl. 3(2), 85–99 (2012)
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159 (1977)
Maia, P., Carvalho, R.N., Ladeira, M., Rocha, H., Mendes, G.: Application of text mining techniques for classification of documents: a study of automation ofcomplaints screening in a Brazilian Federal Agency
Martins, C.A., Monard, M.C., Matsubara, E.T.: Reducing the dimensionality of bag-of-words text representation used by learning algorithms. In: Proceedings of the Third IASTED International Conference on Artificial Intelligence and Applications (AIA 2003), Benalmdena, Espanha, vol. 38 (2003)
Gordon, M., Kochen, M.: Recall-precision trade-off: a derivation (1988)
Pyo, S., Kim, E., kim, M.: LDA-based unified topic modeling for similar TV usergrouping and TV program recommendation. IEEE Trans. Cybern. 45(8), 1476–1490 (2015)
Ramos, J.: Using TF-IDF to determine word relevance in document queries. In: Proceedings of the First Instructional Conference on Machine Learning (2003)
Sales, L.: Risk prevention of public procurement in the Brazilian government using credit scoring. OBEGEF working papers, OBEGEF - Observatrio de Economia e Gesto de Fraude (2013)
Tan, C.-M., Wang, Y.-F., Lee, C.-D.: The use of bigrams to enhance text categorization. Inf. Process. Manage. 38(4), 529–546 (2002)
Wei, X., Croft, W.B.: LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 178–185. ACM (2006)
Acknowledgments
The authors would like to thank the Federal Internal Control Secretariat (SFC, in Portuguese), specially the Coordination of Control and Core Audit of Information Technology (GSNTI, in Portuguese) for the partnership during this work. Finally the authour thank the Director of the Department of Research and Strategy Information, Gilson Libório, and CGU for their support and for allowing the publication of this work.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Maia, P., Sales, L., Carvalho, R.N. (2016). Identifying the Main Problems in IT Auditing: A Comparison Between Unsupervised and Supervised Learning. In: Kő, A., Francesconi, E. (eds) Electronic Government and the Information Systems Perspective. EGOVIS 2016. Lecture Notes in Computer Science(), vol 9831. Springer, Cham. https://doi.org/10.1007/978-3-319-44159-7_17
Download citation
DOI: https://doi.org/10.1007/978-3-319-44159-7_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44158-0
Online ISBN: 978-3-319-44159-7
eBook Packages: Computer ScienceComputer Science (R0)