Skip to main content

Connecting Data for Digital Libraries: The Library, the Dictionary and the Corpus

  • Conference paper
  • First Online:
Digital Libraries at the Crossroads of Digital Information for the Future (ICADL 2019)

Abstract

The paper presents two experiments related to enhancing the content of a digital library with data from external repositories. The concept involves three related resources: a digital library of Middle Polish prints where items are stored in image form, the same items in textual form in a linguistically annotated corpus, and a dictionary of Middle Polish. The first experiment demonstrates how the results of automated OCR obtained with open source tools can be replaced with transcribed content from the corpus, enabling the user to search within individual prints. The second experiment links the print content with the electronic dictionary, filtering relevant entries with the dictionary of modern Polish to eliminate redundant results. Interconnecting all relevant resources in a digital library-centered platform creates new possibilities both for researchers involved in development of these resources as well as for scholars studying the Polish language of the 17th and 18th centuries.

The work was financed by a research grant from the Polish Ministry of Science and Higher Education under the National Programme for the Development of Humanities for the years 2019–2023 (grant 11H 18 0413 86, grant funds received: 1,797,741 PLN).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://cbdu.ijp.pan.pl/.

  2. 2.

    Content objects are currently available for over 1 400 prints and constitute to 11 585 pages.

  3. 3.

    See print 1264 at http://cbdu.ijp.pan.pl/12640/. The print also features a historical commentary and a manually created glossary explaining more difficult terms.

  4. 4.

    http://zil.ipipan.waw.pl/PoliMorf, version 0.6.7.

  5. 5.

    http://sgjp.pl/.

  6. 6.

    http://morfologik.blogspot.com/.

  7. 7.

    https://tshwanedje.com/tshwanelex/.

  8. 8.

    E.g. http://www.macmillandictionaries.com/features/from-corpus-to-dictionary/.

  9. 9.

    https://github.com/tesseract-ocr/.

  10. 10.

    https://www.abbyy.com/finereader/.

  11. 11.

    http://www.impact-project.eu/.

  12. 12.

    For two prints out of the 40 transcribed ones the OCR process failed with segmentation fault.

  13. 13.

    https://github.com/jbarlow83/OCRmyPDF.

  14. 14.

    Note: all open source tools used in the process should be treated only as examples and similar results can be achieved with different software. For the OCR PDF conversion the authors of OCRmyPDF suggest several other open source programs for comparison such as pdf2pdfocr or pdfsandwich; a commercial ABBYY FineReader suite is also frequently used for similar purposes.

  15. 15.

    https://www.scootersoftware.com/.

  16. 16.

    https://metacpan.org/pod/CAM::PDF.

  17. 17.

    Page 3 of a print 1441: Relacja koronacji cudownego obrazu Najświętszej Marii Panny na Górze Różańcowej [w Podkamieniu] (Report on the coronation of the miraculous image of the Virgin Mary on the Rosary Hill [in Podkamień]; https://cbdu.ijp.pan.pl/14410/).

  18. 18.

    The most recent report available to the author estimates the market share for Adobe Reader 10 at 25% market share and Adobe Reader 11 at 55%, see https://www.flexera.com/about-us/press-center/secunia-quarterly-country-report-pdf-readers-left-to-attacks-on-private-us-pcs.html.

  19. 19.

    Version DC 19.012.20035.

  20. 20.

    https://itextpdf.com/en/products/itext-7/itext-7-community, version 7.1.0. Again, there exist many commercial and non-commercial replacements for the software used, such as Evermap AutoBookmark Plug-in https://www.evermap.com/autobookmark.asp or Qoppa jPDFProcess Library https://www.qoppa.com/pdfprocess/.

References

  1. EPrints Manual (2010). http://wiki.eprints.org/w/EPrints_Manual

  2. Atkins, B.T.S., Rundell, M.: The Oxford Guide to Practical Lexicography. Oxford University Press, Oxford (2008)

    Google Scholar 

  3. Bień, J.S.: Efficient search in hidden text of large DjVu documents. In: Bernardi, R., Chambers, S., Gottfried, B., Segond, F., Zaihrayeu, I. (eds.) AT4DL/NLP4DL -2009. LNCS, vol. 6699, pp. 1–14. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23160-5_1

    Chapter  Google Scholar 

  4. Bilińska, J., Bronikowska, R., Gawłowicz, Z., Ogrodniczuk, M., Wieczorek, A., Żółtak, M.: Integration of the electronic dictionary of the 17th–18th century Polish and the electronic corpus of the 17th and 18th century Polish texts. Accepted for the Sixth Conference on Electronic Lexicography (eLex 2019) (2019)

    Google Scholar 

  5. Borgman, C.L.: What are digital libraries? Competing visions. Inf. Process. Manag. 35(3), 227–243 (1999). https://doi.org/10.1016/S0306-4573(98)00059-4

    Article  Google Scholar 

  6. Bronikowska, R., Gruszczyński, W., Ogrodniczuk, M., Woliński, M.: The use of electronic historical dictionary data in corpus design. Stud. Polish Linguist. 11(2), 47–56 (2016). https://doi.org/10.4467/23005920SPL.16.003.4818

    Article  Google Scholar 

  7. Gruszczyński, W. (ed.): Elektroniczny słownik języka polskiego XVII i XVIII w. (Electronic Dictionary of the 17th and the 18th century Polish, in Polish). Instytut Języka Polskiego PAN (2004). https://sxvii.pl/

  8. Gruszczyński, W.: O przyszłości słownika języka polskiego XVII i 1. połowy XVIII wieku (On the future of the Polish dictionary of 17 and the first half of the 18th century, in Polish). Poradnik Językowy 7, 48–61 (2005)

    Google Scholar 

  9. Gruszczyński, W., Ogrodniczuk, M.: Cyfrowa Biblioteka Druków Ulotnych Polskich i Polski dotyczących z XVI, XVII i XVIII w. w nauce i dydaktyce (Digital library of Poland-related old ephemeral prints in research and teaching, in Polish). In: Materiały konferencji Polskie Biblioteki Cyfrowe 2010 (Proceedings of the Polish Digital Libraries 2010 Conference), Poznań, Poland, pp. 23–27 (2010)

    Google Scholar 

  10. Heliński, M., Kmieciak, M., Parkoła, T.: Report on the comparison of Tesseract and ABBYY FineReader OCR engines. Technical report, Poznań Supercomputing and Networking Center, Poznań (2012)

    Google Scholar 

  11. Joffe, D., MacLeod, M., de Schryver, G.M.: Software demonstration: the TshwaneLex electronic dictionary system. In: Elisenda Bernal, J.D. (ed.) Proceedings of the Thirteenth EURALEX International Congress, pp. 421–424. Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra, Barcelona (2008)

    Google Scholar 

  12. Kilgariff, A.: Putting the corpus into the dictionary. In: Ooi, V.B., Pakir, A., Talib, I.S., Tan, P.K. (eds.) Perspectives in Lexicography: Asia and Beyond, pp. 239–247. K Dictionaries (2009)

    Google Scholar 

  13. Miłkowski, M.: Developing an open-source, rule-based proofreading tool. Softw. Pract. Exp. 40(7), 543–566 (2010). https://doi.org/10.1002/spe.v40:7

    Article  Google Scholar 

  14. Ogrodniczuk, M., Gruszczyński, W.: Digital library of Poland-related old ephemeral prints: preserving multilingual cultural heritage. In: Proceedings of the Workshop on Language Technologies for Digital Humanities and Cultural Heritage, Hissar, Bulgaria, pp. 27–33 (2011). http://www.aclweb.org/anthology/W11-4105

  15. Ogrodniczuk, M., Gruszczyński, W.: Digital library 2.0 – source of knowledge and research collaboration platform. In: Calzolari, N., et al. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 1649–1653. European Language Resources Association, Reykjavík (2014). http://www.lrec-conf.org/proceedings/lrec2014/pdf/14_Paper.pdf

  16. Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B. (eds.): Narodowy Korpus Języka Polskiego (National Corpus of Polish, in Polish). Wydawnictwo Naukowe PWN, Warsaw (2012)

    Google Scholar 

  17. Siekierska, K. (ed.): Słownik języka polskiego XVII i 1. płoowy XVIII w. (Dictionary of the 17th century and 1st half of the 18th century Polish, in Polish), vol. 1. Instytut Języka Polskiego PAN, Kraków (1999–2004)

    Google Scholar 

  18. Woliński, M., Kieraś, W.: The on-line version of Grammatical Dictionary of Polish. In: Calzolari, N., et al. (eds.) Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 2589–2594. European Language Resources Association, Portorož (2016). http://www.lrec-conf.org/proceedings/lrec2016/pdf/1157_Paper.pdf

  19. Woliński, M., Miłkowski, M., Ogrodniczuk, M., Przepiórkowski, A., Szałkiewicz, Ł.: PoliMorf: a (not so) new open morphological dictionary for Polish. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pp. 860–864. European Language Resources Association, Istanbul (2012). http://www.lrec-conf.org/proceedings/lrec2012/pdf/263_Paper.pdf

  20. Zawadzki, K.: Gazety ulotne polskie i Polski dotyczące z XVI, XVII i XVIII wieku (Polish and Poland-related ephemeral prints from the 16th–18th centuries, in Polish). National Ossoliński Institute, Polish Academy of Sciences, Wrocław (1990)

    Google Scholar 

  21. Zhang, X.: Knowledge service and digital library: a roadmap for the future. In: Chen, Z., Chen, H., Miao, Q., Fu, Y., Fox, E., Lim, E. (eds.) ICADL 2004. LNCS, vol. 3334, pp. 104–114. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-30544-6_11

    Chapter  Google Scholar 

Download references

Acknowledgements

The authors would like to thank Grzegorz Kulesza for his diligent proofreading of this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maciej Ogrodniczuk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ogrodniczuk, M., Gruszczyński, W. (2019). Connecting Data for Digital Libraries: The Library, the Dictionary and the Corpus. In: Jatowt, A., Maeda, A., Syn, S. (eds) Digital Libraries at the Crossroads of Digital Information for the Future. ICADL 2019. Lecture Notes in Computer Science(), vol 11853. Springer, Cham. https://doi.org/10.1007/978-3-030-34058-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-34058-2_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-34057-5

  • Online ISBN: 978-3-030-34058-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics