Skip to main content

Digitization Workflow in the Czech Digital Mathematics Library

  • Conference paper
  • First Online:
Computer Mathematics

Abstract

Experience in setting up a workflow from scanned images of mathematical writings into a fully fledged mathematical library is described on the example of the project Czech Digital Mathematics Library DML-CZ. An overview of the whole process is given, with detailed description of production steps involving scanned image processing and optical character recognition. Experience gained, lessons learned, and tools prepared during development of DML-CZ are described. DML-CZ now serves more than 30,000 articles (more than 300,000 digitised pages) to the public.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Goble, C.: Curating services and workflows: the good, the bad and the downright ugly, 2008. Keynote presented at ECDL 2008, http://www.ecdl2008.org/keynotes/

  2. Jackson, A.: The digital mathematics library. Notices of Am. Math. Soc., 50(4):918–923 (2003). http://www.ams.org/notices/200308/comm-jackson.pdf

  3. Bartošek, M., Kovář, P., Šárfy, M.: DML-CZ metadata editor: Content creation system for digital libraries. In: Sojka [24], pp. 139–151. http://www.fi.muni.cz/sojka/dml-2008-program.xhtml

  4. Řehůřek, R., Sojka, P.: Automated classification and categorization of mathematical knowledge. In: Serge Autexier, John Campbell, Julio Rubio, Volker Sorge, Masakazu Suzuki, and Freek Wiedijk, editors, Intelligent Computer Mathematics-Proceedings of 7th International Conference on Mathematical Knowledge Management MKM 2008, volume 5144 of Lecture Notes in Computer Science LNCS/LNAI, pages 543–557, Berlin, Heidelberg, July 2008. Springer-Verlag

    Google Scholar 

  5. Růžička, M.: Automated processing of TEX-typeset articles for a digital library. In: Sojka [24], pp. 167–176. http://www.fi.muni.cz/sojka/dml-2008-program.xhtml

  6. Chevalier, P.: i2S DigiBook Mag, issue no. 2, July 2002. http://ww.i2s-bookscanner.com/pdf/digibook_mag_no2.pdf

  7. Smith, R., Newton, C., Cheatle, P.: Adaptive thresholding for OCR: A significant test. Technical Report HPL-1993-22, HP Laboratories Bristol, March 1993

    Google Scholar 

  8. Rangoni, Y., Shafait, F., Breuel, T.M.: OCR based thresholding. In: Proceedings of MVA 2009 IAPR Conference on Machine Vision Applications, pp. 3–18, May 2009

    Google Scholar 

  9. Simske, S.J., Lin, X.: Creating digital libraries: Content generation and re-mastering. In: Proceedings of First International Workshop on Document Image Analysis for Libraries (DIAL 2004), p. 13, 2004. http://doi.ieeecomputersociety.org/10.1109/DIAL.2004.1263235

  10. Pulkrábek, T.: Obrazové transformace při digitalizaci textů (in Czech, Image Transformation during Digitisation). Master’s thesis, Faculty of Informatics, 2008. Bachelor’s Thesis Masaryk University, Brno, Faculty of Informatics. https://is.muni.cz/th/139908/fi_b/?lang=en

  11. Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY–An integrated OCR system for mathematical documents. In: C. Vanoirbeek, C. Roisin, and E. Munson, editors, Proceedings of ACM Symposium on Document Engineering 2003, pp. 95–104, Grenoble, France, 2003. ACM

    Google Scholar 

  12. Sojka, P.: Towards digital mathematical library: optical character recognition of mathematical texts. In: Julius Štuller and Zdenka Linková, editors, Inteligentní modely, algoritmy a nástroje pro vytváření semantického webu, pp. 110–113, Prague, 2006. Ústav informatiky AV ČR

    Google Scholar 

  13. Sojka, P., Panák, R., Mudrák, T.: Optical character recognition of mathematical texts in the DML-CZ project. Technical report, Masaryk University, Brno, September 2006. Presented at CMDE 2006 conference in Aveiro, Portugal

    Google Scholar 

  14. Sojka, P., Řehůřek, R.: Classification of multilingual mathematical papers in DML-CZ. In: Petr Sojka and Aleš Horák, editors, Proceedings of Recent Advances in Slavonic Natural Language Processing–RASLAN 2007, pp. 89–96, Karlova Studánka, Czech Republic, December 2007. Masaryk University

    Google Scholar 

  15. Dunning, T.: Statistical identification of language. Technical Report MCCS 94–273, New Mexico State University, Computing Research Lab (1994)

    Google Scholar 

  16. Marosi, I., Tóth, L.: OCR voting methods for recognizing low contrast printed documents. In: Proceedings of Second International Conference on Document Image Analysis for Libraries (DIAL 2006), pp. 108–115, April 2006

    Google Scholar 

  17. Mudrák, T.: Digitalizace matematických textů (in Czech, Digitisation of Mathematical Texts). Master’s thesis, Masaryk University, Brno, Faculty of Informatics, April 2006. https://is.muni.cz/th/60738/fi_m/?lang=en

  18. Panák, R.: Digitalizácia matematických textov (in Czech, Digitisation of Mathematical Texts). Master’s thesis, Masaryk University, Brno, Faculty of Informatics, April 2006. https://is.muni.cz/th/60587/fi_m/?lang=en

  19. Sojka, P.: Workflow in the digital mathematics library project: How mathematics is stored and retrieved. In: J. Paralič, J. Dvorský, and M. Krátký, editors, Proceedings of Znalosti 2006, pp. 243–247. VŠB-Technická univerzita Ostrava (2006)

    Google Scholar 

  20. DML-CZ. Digitization metadata editor. http://sourceforge.net/projects/dme/, (2009)

  21. Salton, Gerard, Buckley, Chris: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)

    Article  Google Scholar 

  22. Deerwester, Scott C., Dumais, Susan T., Landauer, Thomas K., Furnas, George W., Harshman, Richard A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)

    Article  Google Scholar 

  23. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  24. Sojka, P. (ed.): Towards a digital mathematics library, Birmingham, UK, July 2008. Masaryk University. http://www.fi.muni.cz/sojka/dml-2008-program.xhtml

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Petr Sojka .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sojka, P. (2014). Digitization Workflow in the Czech Digital Mathematics Library. In: Feng, R., Lee, Ws., Sato, Y. (eds) Computer Mathematics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43799-5_13

Download citation

Publish with us

Policies and ethics