LinkedSaeima: A Linked Open Dataset of Latvia’s Parliamentary Debates
This paper describes the LinkedSaeima dataset that contains structured data about Latvia’s parliamentary debates from 1993 until 2017. This information is published at http://dati.saeima.korpuss.lv as Linked Open Data. It is a part of the Corpus of Saeima (the Parliament of Latvia) released as open data for multidisciplinary research. The data model of LinkedSaeima follows the data structure of the LinkedEP dataset with a few modifications. The dataset is augmented with links to the Wikidata knowledge base that provide additional information about the speakers and named entities mentioned in the corpus.
KeywordsLinked Open Data Parliament debate corpus Named entity linking Open government data RDF
To ensure transparency of political and legislative processes, parliament proceedings and debate transcripts are usually made public. Saeima – the Parliament of the Republic of Latvia – publishes plenary transcripts on its website as unstructured text1. In 2016 we published this as a text corpus with speaker annotations and other metadata .
With the increasing availability of corpora in different languages we realized that unannotated corpora are not enough to address various researchers’ needs such as comparative research across multiple languages. The 2018 release of the Corpus of Saeima attempted to address this concern by adding multiple additional annotation layers including named entity mentions, automated English translation and morphosyntactic information for linguistic analysis . This release is available in multiple commonly used formats: as a text corpus in NoSketch query software2, as syntactically parsed data and as Linked Open Data .
This paper describes LinkedSaeima3 – a Linked Data representation of the Corpus of Saeima containing structured information about Saeima proceedings and the entities mentioned in the proceedings, represented using Wikidata identifiers . Linked Data allows us to represent structured information about parliamentary debates by describing the properties of the objects from the domain of parliamentary meetings and relations between these objects.
2 Parliamentary Speech Corpus
The source of data for this corpus is the Saeima website that contains transcripts of all parliament sessions in text format. These transcripts are processed using a semi-automatic pipeline to identify the boundaries of speeches and the speakers.
The Corpus of Saeima contains information about debates from seven parliamentary terms (5th–12th Saeima) covering years 1993–2017. The transcriptions of this corpus contain 38 million tokens and 497 thousand utterances. The available metadata for each utterance includes the date and type of the parliamentary session and speakers’ names and affiliations. A subset of speeches, starting from 2015, were translated from Latvian to English using a neural machine translation system . The unreviewed machine-generated translation is included in the corpus for quantitative analysis purposes and to aid searchability and understanding for international researchers. However, the text quality of automated translation is not sufficient for qualitative analysis of the Saeima corpus.
The named entities mentioned in this corpus were automatically linked to Wikidata as the entity knowledge base . The named entity recognition system is based on a full text search of Wikidata entity names, extending these aliases by generating a heuristic list of alternative variants for organization and people names, and inflecting them through a custom Latvian phrase inflection system built upon the Latvian morphosyntactic tagger . As the goal of named entity recognition was primarily to provide a mapping to Wikidata, no technical means were applied to recognize entities without relevant Wikidata entries, however, in order to improve the coverage of entity linking, Wikidata entries for historical members of parliament and other officials were created (if not already existant) and populated with data based on open access sources available from Saeima. For the purposes of disambiguation of entities with overlapping names, the most likely entity was chosen based on a cosine similarity metric with respect to structured Wikidata information extracts, adapting a system developed earlier for news corpora analysis .
3 LinkedSaeima Dataset
This paper focuses on LinkedSaeima – the Linked Data representation of the Saeima speech corpus. The current version of the dataset, published in May 2019, consists of approx. 4.9 million RDF triples4. Since the original January 2018 release we have fixed the identified issues with its RDF representation and improved the usability of the human-readable view of the dataset.
Meeting (lpv_eu:SessionDay) – a top-level concept representing one parliament plenary meeting usually consisting of multiple Speeches;
Speech (lpv_eu:Speech) – an individual speech (utterance) given at a Meeting by a single Speaker in a particular Role;
Speaker (lpv:Speaker) – a person giving a speech;
Role (lpv:PoliticalFunction) – a role which the person represented when giving a Speech (e.g. the Prime Minister). A person may appear in multiple roles.
There is ongoing work for standardization of corpora of parliamentary proceedings based on TEI . Our approach could be applied to other parliamentary speech corpora by implementing a transformation from the TEI standard once it is finalized in order to make these resources available as Linked Data.
4 Data Access and Implementation
URI patterns used in the LinkedSaeima dataset.
In this paper we described LinkedSaeima – a Linked Data representation of the dataset of Latvia’s parliamentary debates extended with NLP annotation layers. We hope that its Linked Data representation and the new annotation levels (entity references and translation) will allow researchers from other countries to use this resource in their studies, comparing Latvia’s parliamentary data with data from other national parliaments and to provide users with new ways of exploring this information.
Expected future work includes extending the LinkedSaeima dataset with additional types of structured information, for example, voting data, and adding automated translations for the whole historical dataset. Improvements to entity recognition and morphosyntatic tagging are being carried out as part of related research projects.
By publishing this parliamentary corpus as Linked Open Data and by including links to Wikidata entities we hope to facilitate the development of a global network of linked political and legal information, and to provide an example to other implementers.
This research has been partially supported by the University of Latvia project AAP2016/B032 “Innovative information technologies”, the European Regional Development Fund under the grant agreement No. 220.127.116.11/16/A/219 and the research project “Competence Centre of Information and Communication Technologies” of EU Structural funds, IT Competence Centre contract No. 18.104.22.168/18/A/003 research project No. 2.4 “Platform for the semantically structured information extraction from the massive Latvian news archive”.
- 1.Darģis, R., Rābante-Buša, G., Auziņa, I., Kruks, S.: ParliSearch - A system for large text corpus discourse analysis. Frontiers in Artificial Intelligence and Applications, vol. 289, pp. 115–121 (2016)Google Scholar
- 2.Darģis, R., Auziņa, I., Bojārs, U., Paikens, P., Znotiņš, A.: Annotation of the corpus of the Saeima with multilingual standards. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) (2018)Google Scholar
- 3.Heath, T., Bizer, C.: Linked Data: Evolving the Web into a Global Data Space. Synthesis Lectures on the Semantic Web, 1st edn., vol. 1, no. 1, pp. 1–136. Morgan & Claypool (2011)Google Scholar
- 5.Barone, A.V.M., Helcl, J., Sennrich, R., Haddow, B., Birch, A.: Deep architectures for neural machine translation. In: Proceedings of the Second Conference on Machine Translation, Vol. 1: Research Papers, pp. 99–107. Association for Computational Linguistics (2017)Google Scholar
- 6.Paikens, P.: Deep neural learning approaches for Latvian morphological tagging. In: Proceedings of Human Language Technologies - The Baltic Perspective, pp. 119–125 (2016)Google Scholar
- 7.Paikens, P.: Latvian newswire information extraction system and entity knowledge base. In: Proceedings of Human Language Technologies - The Baltic Perspective, pp. 119–125 (2014)Google Scholar
- 9.Erjavec, T., Pančur, A.: Parla-CLARIN: a TEI schema for corpora of parliamentary proceedings (2019). https://clarin-eric.github.io/parla-clarin/
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.