Abstract
Experience in setting up a workflow from scanned images of mathematical writings into a fully fledged mathematical library is described on the example of the project Czech Digital Mathematics Library DML-CZ. An overview of the whole process is given, with detailed description of production steps involving scanned image processing and optical character recognition. Experience gained, lessons learned, and tools prepared during development of DML-CZ are described. DML-CZ now serves more than 30,000 articles (more than 300,000 digitised pages) to the public.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Goble, C.: Curating services and workflows: the good, the bad and the downright ugly, 2008. Keynote presented at ECDL 2008, http://www.ecdl2008.org/keynotes/
Jackson, A.: The digital mathematics library. Notices of Am. Math. Soc., 50(4):918–923 (2003). http://www.ams.org/notices/200308/comm-jackson.pdf
Bartošek, M., Kovář, P., Šárfy, M.: DML-CZ metadata editor: Content creation system for digital libraries. In: Sojka [24], pp. 139–151. http://www.fi.muni.cz/sojka/dml-2008-program.xhtml
Řehůřek, R., Sojka, P.: Automated classification and categorization of mathematical knowledge. In: Serge Autexier, John Campbell, Julio Rubio, Volker Sorge, Masakazu Suzuki, and Freek Wiedijk, editors, Intelligent Computer Mathematics-Proceedings of 7th International Conference on Mathematical Knowledge Management MKM 2008, volume 5144 of Lecture Notes in Computer Science LNCS/LNAI, pages 543–557, Berlin, Heidelberg, July 2008. Springer-Verlag
Růžička, M.: Automated processing of TEX-typeset articles for a digital library. In: Sojka [24], pp. 167–176. http://www.fi.muni.cz/sojka/dml-2008-program.xhtml
Chevalier, P.: i2S DigiBook Mag, issue no. 2, July 2002. http://ww.i2s-bookscanner.com/pdf/digibook_mag_no2.pdf
Smith, R., Newton, C., Cheatle, P.: Adaptive thresholding for OCR: A significant test. Technical Report HPL-1993-22, HP Laboratories Bristol, March 1993
Rangoni, Y., Shafait, F., Breuel, T.M.: OCR based thresholding. In: Proceedings of MVA 2009 IAPR Conference on Machine Vision Applications, pp. 3–18, May 2009
Simske, S.J., Lin, X.: Creating digital libraries: Content generation and re-mastering. In: Proceedings of First International Workshop on Document Image Analysis for Libraries (DIAL 2004), p. 13, 2004. http://doi.ieeecomputersociety.org/10.1109/DIAL.2004.1263235
Pulkrábek, T.: Obrazové transformace při digitalizaci textů (in Czech, Image Transformation during Digitisation). Master’s thesis, Faculty of Informatics, 2008. Bachelor’s Thesis Masaryk University, Brno, Faculty of Informatics. https://is.muni.cz/th/139908/fi_b/?lang=en
Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY–An integrated OCR system for mathematical documents. In: C. Vanoirbeek, C. Roisin, and E. Munson, editors, Proceedings of ACM Symposium on Document Engineering 2003, pp. 95–104, Grenoble, France, 2003. ACM
Sojka, P.: Towards digital mathematical library: optical character recognition of mathematical texts. In: Julius Štuller and Zdenka Linková, editors, Inteligentní modely, algoritmy a nástroje pro vytváření semantického webu, pp. 110–113, Prague, 2006. Ústav informatiky AV ČR
Sojka, P., Panák, R., Mudrák, T.: Optical character recognition of mathematical texts in the DML-CZ project. Technical report, Masaryk University, Brno, September 2006. Presented at CMDE 2006 conference in Aveiro, Portugal
Sojka, P., Řehůřek, R.: Classification of multilingual mathematical papers in DML-CZ. In: Petr Sojka and Aleš Horák, editors, Proceedings of Recent Advances in Slavonic Natural Language Processing–RASLAN 2007, pp. 89–96, Karlova Studánka, Czech Republic, December 2007. Masaryk University
Dunning, T.: Statistical identification of language. Technical Report MCCS 94–273, New Mexico State University, Computing Research Lab (1994)
Marosi, I., Tóth, L.: OCR voting methods for recognizing low contrast printed documents. In: Proceedings of Second International Conference on Document Image Analysis for Libraries (DIAL 2006), pp. 108–115, April 2006
Mudrák, T.: Digitalizace matematických textů (in Czech, Digitisation of Mathematical Texts). Master’s thesis, Masaryk University, Brno, Faculty of Informatics, April 2006. https://is.muni.cz/th/60738/fi_m/?lang=en
Panák, R.: Digitalizácia matematických textov (in Czech, Digitisation of Mathematical Texts). Master’s thesis, Masaryk University, Brno, Faculty of Informatics, April 2006. https://is.muni.cz/th/60587/fi_m/?lang=en
Sojka, P.: Workflow in the digital mathematics library project: How mathematics is stored and retrieved. In: J. Paralič, J. Dvorský, and M. Krátký, editors, Proceedings of Znalosti 2006, pp. 243–247. VŠB-Technická univerzita Ostrava (2006)
DML-CZ. Digitization metadata editor. http://sourceforge.net/projects/dme/, (2009)
Salton, Gerard, Buckley, Chris: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)
Deerwester, Scott C., Dumais, Susan T., Landauer, Thomas K., Furnas, George W., Harshman, Richard A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to information retrieval. Cambridge University Press, Cambridge (2008)
Sojka, P. (ed.): Towards a digital mathematics library, Birmingham, UK, July 2008. Masaryk University. http://www.fi.muni.cz/sojka/dml-2008-program.xhtml
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sojka, P. (2014). Digitization Workflow in the Czech Digital Mathematics Library. In: Feng, R., Lee, Ws., Sato, Y. (eds) Computer Mathematics. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-43799-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-662-43799-5_13
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-43798-8
Online ISBN: 978-3-662-43799-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)