Abstract
We present a dataset of word embeddings for the Polish language. Presented embeddings can be used as an input for Artificial Intelligence methods as an alternative for one-hot representation. Spatial relations between embeddings reflect relations such as alternatives and analogies. This improves generalization of methods using presented embeddings. Data from Wikipedia has been used together with skip-gram and contitous-bag-of-words methods introduced originally for English language by Mikolov et al. Current version of embeddings can be downloaded from http://publications.ics.p.lodz.pl/2016/word_embeddings/.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
Even though sparse representations are encoded in the compact form all the operations are performed as if they still were full-size vectors.
References
Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)
Chen, Y., Perozzi, B., Al-Rfou, R., Skiena, S.: The expressive power of word embeddings. arXiv preprint (2013). arXiv:1301.3226
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)
Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Annual Meeting of the Association for Computational Linguistics (ACL) (2012)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT press, Cambridge (1999)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Burges, C., Bottou, L., Welling, M., Ghahramani, Z., Weinberger, K. (eds.) Advances in Neural Information Processing Systems 26, pp. 3111–3119. Curran Associates, Inc. (2013)
Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word representations. In: HLT-NAACL, pp. 746–751 (2013)
Mnih, A., Hinton, G.: Three new graphical models for statistical language modelling. In: Proceedings of the 24th International Conference on Machine learning. pp. 641–648. ACM (2007)
Przepiórkowski, A.: A comparison of two morphosyntactic tagsets of polish. In: Representing Semantics in Digital Lexicography: Proceedings of MONDILEX Fourth Open Workshop. pp. 138–144. Warsaw (2009)
Turian, J., Ratinov, L., Bengio, Y.: Word representations: a simple and general method for semi-supervised learning. In: Proceedings of The 48th Annual Meeting of The Association for Computational Linguistics. pp. 384–394. Association for Computational Linguistics (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Rogalski, M., Szczepaniak, P.S. (2016). Word Embeddings for the Polish Language. In: Rutkowski, L., Korytkowski, M., Scherer, R., Tadeusiewicz, R., Zadeh, L., Zurada, J. (eds) Artificial Intelligence and Soft Computing. ICAISC 2016. Lecture Notes in Computer Science(), vol 9692. Springer, Cham. https://doi.org/10.1007/978-3-319-39378-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-39378-0_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39377-3
Online ISBN: 978-3-319-39378-0
eBook Packages: Computer ScienceComputer Science (R0)