Advertisement

Automatic Anonymization of Printed-Text Document Images

  • Ángel SánchezEmail author
  • José F. Vélez
  • Javier Sánchez
  • A. Belén Moreno
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10884)

Abstract

Nowadays, the storage and transmission of some types of documents requires the removal of personal information from involved users. Automatic text anonymization or de-identification is a solution for hiding all sensible information contained in the documents. Although the problem has been mainly studied for plain printed-text documents, there are not works where the de-identification task also produces anonymized document images with the same text fonts as those in the original documents. This data augmentation process could be applied to train a system for document image classification. In this paper, we describe an implementation of an automated anonymization modular system for printed-text image documents written in Spanish. System evaluation performed on a dataset of invoice images shows the viability of our proposal.

Keywords

Document image analysis Printed-text anonymization Regular expression Font classification Convolutional neural network 

Notes

Acknowledgements

This work has been funded by the Spanish Ministry of Economy and Competitiveness under project number TIN2014-57458-R.

References

  1. 1.
    Dias, F., Mamede, N., Baptista, J.: Automated anonymization of text documents. In: Proceedings of the IEEE Congress on Evolutionary Computation, pp. 1287–1294 (2016)Google Scholar
  2. 2.
    Fawcett, T.: An introduction to ROC analysis. Pattern Recogn. Lett. 27, 861–874 (2006)CrossRefGoogle Scholar
  3. 3.
    Garfinkel, S.L.: De-Identification of personal information. National Institute of Standards and Technology (NIST). Internal Report 8053 (2015)Google Scholar
  4. 4.
    Klein, G., Rowe, S., Décamps, R.: JFlex User’s Manual. Version 1.6.1. (2015). URL: http://jflex.de/manual.html. Accessed 29 Feb 2018
  5. 5.
    Khovratovich, D., Rechberger, C., Savelieva, A.: Bicliques for preimages: attacks on Skein-512 and the SHA-2 family. In: Canteaut, A. (ed.) FSE 2012. LNCS, vol. 7549, pp. 244–263. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-34047-5_15CrossRefGoogle Scholar
  6. 6.
    Lee, C.W., Jung, K.: NMF-based approach to font classification of printed English alphabets for document image understanding. In: Torra, V., Narukawa, Y., Miyamoto, S. (eds.) MDAI 2005. LNCS (LNAI), vol. 3558, pp. 354–364. Springer, Heidelberg (2005).  https://doi.org/10.1007/11526018_35CrossRefGoogle Scholar
  7. 7.
    Levine, J.: Flex & Bison. O’Reilly Media, Sebastopol (2009)Google Scholar
  8. 8.
    Meystre, S.M., Friedlin, F.J., South, B.R., Shen, S., Samore, M.H.: Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med. Res. Methodol. 10, 70 (2010)CrossRefGoogle Scholar
  9. 9.
    Saini, K., Kaur, S.: Forensic examination of computer-manipulated documents using image processing techniques. Egypt. J. Forensic Sci. 6, 317–322 (2016)CrossRefGoogle Scholar
  10. 10.
    Tesseract OCR: Tesseract Open Source OCR Engine (main repository). URL: https://github.com/tesseract-ocr. Accessed 10 Feb 2018
  11. 11.
    Vico, H., Calegari, D.: Software architecture for document anonymization. Electron. Notes Theoret. Comput. Sci. 314, 83–100 (2015)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Ángel Sánchez
    • 1
    Email author
  • José F. Vélez
    • 1
  • Javier Sánchez
    • 1
  • A. Belén Moreno
    • 1
  1. 1.Rey Juan Carlos UniversityMóstolesSpain

Personalised recommendations