Parsing of Polish in Graph Database Environment

Posiadała, Jan; Czaja, Hubert; Szczechla, Eliza; Susicki, Paweł

doi:10.1007/978-3-319-93782-3_7

Jan Posiadała¹⁶,
Hubert Czaja¹⁶,
Eliza Szczechla¹⁶ &
…
Paweł Susicki¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10930))

Included in the following conference series:

Language and Technology Conference

502 Accesses

Abstract

This paper describes the basic concepts and features of the Langusta system. Langusta is a natural language processing environment embedded in a graph database. The paper presents a rule-based syntactic parsing system for the Polish language using various linguistic resources, including those containing semantic information. The advantages of this approach are directly related to the deployment of the graph paradigm, in particular to the assumption, that rules describing the syntax of the Polish language are valid queries in a graph database query language (Cypher).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://neo4j.com/.
2.
http://orientdb.com/orientdb/.
3.
Apache TinkerPop Project is most known for providing a set of interfaces that graph databases that database vendors can implement (Blueprints) to get all the features of the rest of the TinkerPop stack (Pipes, Gremlin, Frames, Rexster, Furnace) where each part of the stack provides a specific function in supporting graph−based application development; http://tinkerpop.apache.org/.
4.
Java types description: https://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html.
5.
Translation to English: “Young girls run.”.
6.
The list intersection operator *= is not supported by the implementation of Cypher in the Neo4j database. The interpretation is: false if and only if the list is empty.
7.
Correspondence between WHERE expression in Langusta rule and unify operator in SPEJD rule is limited to condition component of unify operator. Application of Langusta rule rejects no interpretation.
8.
Correspondence between semantic of group action in SPEJD rule and consequence of Langusta rule application seems to be very strong, obviously excluding capability of ambiguity representation.
9.
Langusta supports the handling of word order inversion which is common in the Polish language which is a synthetic language. Through this mechanism the number of rules for parsing the corresponding expressions in normal and inverted order is not doubled. The use of mechanism is limited to rules which match 2 Word nodes. That means that in Langusta system, the expression “dziewczyny młode” will be parsed by the same rule (although certainly not by the same query). To apply the a given rule to the inverted word order it suffices to pass in the appropriate InversionRate value in the environment, i.e. the value of the weight for the rule which tries to perform matching using inverted order of of matching nodes.
10.
Phrases “bottle of gasoline”, “sacks for leaves” as instances of prepositional phrases: “container of/for something”. “Bottle” and “sack” are hyponyms of “container” and inherit its valency features.
11.
When the MATCH clause contains more than one path, Langusta selects the first one as the matching path by default. The unnamed and undirected relationships between the nodes on this path are labelled :follows and directed from left to right.
12.
To increase ease of use of the plWordNet dictionary, the rules work with the transitive closure of the WordNet graph, traversing the hyponymy relation edges, taking into account transition through synset groups, i.e. if a lexical unit: lu1 is a hyponyme of a lexical unit lu2, then all the lexical units sharing the same synset group with lu1 are hyponymes of all lexical units sharing a synset group with lu2.
13.
Poliqarp, similary to SPEJD, based its syntax on the formalism CQP derived from the project CWB − The IMS Open Corpus Workbench (http://cwb.sourceforge.net/).
14.
Poliqarp, similary to SPEJD, is was used as a part of NKJP project.

References

Buczyński, A., Przepiórkowski, A.: Demo: an open source tool for partial parsing and morphosyntactic disambiguation. In: Proceedings of LREC 2008 (2008)
Google Scholar
Dipper, S.: Stand-off representation and exploitation of multi-level linguistic annotation. In: Proceedings of Berliner XML Tage 2005 (BXML 2005), pp. 39–50, Berlin (2005)
Google Scholar
Graliński, F., Jassem, K., Junczys-Dowmunt, M.: PSI-Toolkit: Natural language processing pipeline. Computational Linguistics – Applications. Springer, Heidelberg (2012)
Google Scholar
Ide, N., Suderman, K.: GrAF: a graph-based format for linguistic annotations. In: Proceedings of the Linguistic Annotation Workshop, pp. 1–8. Czech Republic, Prague (2007)
Google Scholar
Joshi, A.K., Schabes, Y.: Tree-adjoining grammars. In: Handbook of Formal Languages, vol. 3, pp. 69–123. Springer-Verlag New York, Inc., New York (1997). ISBN:3–540-60649-1
Google Scholar
Negnevitsky, M.: Artificial Intelligence: A Guide to Intelligent Systems. Addison-Wesley Longman Publishing Co., Inc., Boston (2001)
Google Scholar
Maziarz, M., Piasecki, M., Szpakowicz, S.: Approaching plWordNet 2.0. In: Proceedings of the 6th Global Wordnet Conference. Matsue, Japan (2012)
Google Scholar
Mazur, P.: Text segmentation in polish. In: Proceedings of the 5th International Conference on Intelligent Systems Design and Applications (ISDA), pp. 43–48, 8–10 September 2005, Wroclaw, Poland (2005)
Google Scholar
Mihalcea, R., Radev, D.: Graph-Based Natural Language Processing and Information Retrieval. Cambridge University Press, Cambridge (2011)
Google Scholar
Pęzik, P.: Indexed graph databases for querying rich TEI annotation (2013). http://digilab2.let.uniroma1.it/teiconf2013/wp-content/uploads/2013/09/Pezik.pdf
Przepiórkowski, A.: Powierzchniowe przetwarzanie języka polskiego. Akademicka Oficyna Wydawnicza EXIT, Warsaw (2008)
Google Scholar
Przepiórkowski, A., Bańko, M., Górski, R.L., Lewandowska-Tomaszczyk, B. (eds.): Narodowy Korpus Języka Polskiego. Wydawnictwo Naukowe PWN, Warsaw (2012)
Google Scholar
Przepiórkowski, A., Bański, P.: Which XML standards for multilevel corpus annotation? In: Proceedings of the 4th Language & Technology Conference, Poznań, Poland (2009)
Google Scholar
Przepiórkowski, A., Buczyński, A.: Shallow parsing and disambiguation engine. In: Vetulani, Z. (ed.) Proceedings of the 3rd Language & Technology Conference, Poznań, Poland, pp. 340–344 (2007)
Google Scholar
Przepiórkowski, A., Hajnicz, E., Patejuk, A., Woliński, M., Skwarski, F., Świdziński M.: Walenty: Towards a comprehensive valence dictionary of Polish. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, pp. 2785–2792, Reykjavík, Iceland. ELRA (2014)
Google Scholar
Robinson, I., Webber, J., Eifrem, E.: Graph Databases. O’Reilly Media, Massachusetts (2013)
Google Scholar
Rudolf, M., Świdziński, M.: Automatic utterance boundaries recognition in large Polish text corpora. In: Kłopotek, M.A., Wierzchoń, S.T., Trojanowski, K. (eds.) Intelligent Information Processing and Web Mining. Advances in Soft Computing, vol. 25, pp. 247–256. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-39985-8_26
Shi, C., Verhagen, M., Pustejovsky, M.: A conceptual framework of online natural language processing pipeline application. In: Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT, pp. 53–59, Dublin, Ireland, 23 August (2014)
Google Scholar
Strauch, Ch.: NoSQL Databases (2011). http://www.christof-strauch.de/nosqldbs.pdf
Szpakowicz, S.: Automatyczna analiza składniowa polskich zdań pisanych. Praca doktorska (promotor Waligórski S.), Instytut Informatyki UW (1978)
Google Scholar
Świdziński, M.: Gramatyka formalna języka polskiego, “Rozprawy Uniwersytetu Warszawskiego”, t. 349, Warsaw (1992)
Google Scholar
Wilson, J.R.: Introduction to Graph Theory, 4th edn. Addison Wesley, Reading (1996)
Google Scholar
Woliński, M., Miłkowski, M., Ogrodniczuk, M., Przepiórkowski, A., Szałkiewicz, Ł.: PoliMorf: a (not so) new open morphological dictionary for Polish. In: Proceedings of the Eighth International Conference on Language Resources and Evaluation, LREC 2012, pp. 860–864, Istanbul, Turkey. ELRA (2012)
Google Scholar
Woliński, M., Przepiórkowski, A.: Projekt anotacji morfosynktaktycznej korpusu języka polskiego. Prace IPI PAN 938, grudzień 2001 (2001)
Google Scholar
Wood, P.T.: Query languages for graph databases. ACM SIGMOD Rec. 41(1), 50–60 (2012)
Google Scholar
Zeldes, A., Ritz, J., Lüdeling, A., Chiarcos, C.: ANNIS: a search tool for multi-layer annotated corpora. In: Proceedings of Corpus Linguistics 2009, Liverpool, 20–23 July, 2009
Google Scholar

Download references

Author information

Authors and Affiliations

Scott Tiger S.A., 15 Kolektorska Street, Warsaw, Poland
Jan Posiadała, Hubert Czaja, Eliza Szczechla & Paweł Susicki

Authors

Jan Posiadała
View author publications
You can also search for this author in PubMed Google Scholar
Hubert Czaja
View author publications
You can also search for this author in PubMed Google Scholar
Eliza Szczechla
View author publications
You can also search for this author in PubMed Google Scholar
Paweł Susicki
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hubert Czaja .

Editor information

Editors and Affiliations

Adam Mickiewicz University, Poznań, Poland
Zygmunt Vetulani
LIMSI-CNRS, Orsay Cedex, France
Joseph Mariani
Adam Mickiewicz University, Poznań, Poland
Marek Kubis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Posiadała, J., Czaja, H., Szczechla, E., Susicki, P. (2018). Parsing of Polish in Graph Database Environment. In: Vetulani, Z., Mariani, J., Kubis, M. (eds) Human Language Technology. Challenges for Computer Science and Linguistics. LTC 2015. Lecture Notes in Computer Science(), vol 10930. Springer, Cham. https://doi.org/10.1007/978-3-319-93782-3_7

Download citation

DOI: https://doi.org/10.1007/978-3-319-93782-3_7
Published: 16 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-93781-6
Online ISBN: 978-3-319-93782-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics