Automatic Anonymization of Printed-Text Document Images
Nowadays, the storage and transmission of some types of documents requires the removal of personal information from involved users. Automatic text anonymization or de-identification is a solution for hiding all sensible information contained in the documents. Although the problem has been mainly studied for plain printed-text documents, there are not works where the de-identification task also produces anonymized document images with the same text fonts as those in the original documents. This data augmentation process could be applied to train a system for document image classification. In this paper, we describe an implementation of an automated anonymization modular system for printed-text image documents written in Spanish. System evaluation performed on a dataset of invoice images shows the viability of our proposal.
KeywordsDocument image analysis Printed-text anonymization Regular expression Font classification Convolutional neural network
This work has been funded by the Spanish Ministry of Economy and Competitiveness under project number TIN2014-57458-R.
- 1.Dias, F., Mamede, N., Baptista, J.: Automated anonymization of text documents. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 1287–1294 (2016)Google Scholar
- 3.Garfinkel, S.L.: De-Identification of personal information. National Institute of Standards and Technology (NIST). Internal Report 8053 (2015)Google Scholar
- 4.Klein, G., Rowe, S., Décamps, R.: JFlex User’s Manual. Version 1.6.1. (2015). URL: http://jflex.de/manual.html. Accessed 29 Feb 2018
- 6.Lee, C.W., Jung, K.: NMF-based approach to font classification of printed English alphabets for document image understanding. In: Torra, V., Narukawa, Y., Miyamoto, S. (eds.) MDAI 2005. LNCS (LNAI), vol. 3558, pp. 354–364. Springer, Heidelberg (2005). https://doi.org/10.1007/11526018_35CrossRefGoogle Scholar
- 7.Levine, J.: Flex & Bison. O’Reilly Media, Sebastopol (2009)Google Scholar
- 10.Tesseract OCR: Tesseract Open Source OCR Engine (main repository). URL: https://github.com/tesseract-ocr. Accessed 10 Feb 2018