Mining Governmental Collaboration Through Semantic Profiling of Open Data Catalogues and Publishers

Adel Rezk, Mohamed; Ojo, Adegboyega; Hassan, Islam A.

doi:10.1007/978-3-319-65151-4_24

Mohamed Adel Rezk¹⁸,
Adegboyega Ojo¹⁸ &
Islam A. Hassan¹⁸

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 506))

Included in the following conference series:

Working Conference on Virtual Enterprises

2881 Accesses
1 Citations

Abstract

Due to the increasing adoption of open data among governments worldwide especially in the European Union area, a deeper analysis of the newly published data is becoming a mandate. Apart from analyzing the published dataset itself we aimed on analyzing published dataset catalogues. A dataset catalogue or a dataset metadata contains features that describe what the data is about in a textual representation. So, we first acquire data from open data portals, choose descriptive dataset catalogue features, and then construct an aggregated textual representation of the datasets. Afterwards we enrich those textual representations using Natural Language Processing (NLP) methods to create a new comparable data feature “Named Entities”. By mining the new data feature we are able to produce datasets and publishers relatedness network. Those networks are used to point similarities between the published data across multiple open data portals. Pointing all possible collaborations for integrating and standardizing data features and types would increase the value of da1ta and ease its analysis process.

You have full access to this open access chapter, Download conference paper PDF

Adoption of the Linked Data Best Practices in Different Topical Domains

Detecting Identical Entities in the Semantic Web Data

On the Automated Generation of Scholarly Publishing Linked Datasets: The Case of CEUR-WS Proceedings

Keywords

1 Introduction

Despite the availability of data loaded into open data portals worldwide^{Footnote 1} [1, 2], methods to maximize stakeholders’ engagement and ease data integration still not complete [3,4,5]. We believe that a proper mining of collaboration channels within a single data portal internally as well as between multiple open data portals are not introduced yet. Our work is aiming to develop an open data portals collaboration channels mining framework as shown in Fig. 1. To achieve this, we start with data acquisition by harvesting metadata of datasets published on the portal then restructure and store them in MongoDB^{Footnote 2}. Afterwards we construct textual representation from the dataset metadata’s unstructured features, apply DBpedia [6] Named Entity Recognition pipeline called DBpedia Spotlight [7] to extract information that represent those dataset and their publishers as well. After that we end up with a semantically enriched dataset upon which we can apply our profiling [5] and collaboration opportunities analysis. To illustrate our work, we organized the paper as follows: Sect. 2 presents a background on Open Government Data, NLP and Collaboration Mining. Section 3 discusses our approach to tackle the research question. Section 4. Discussing our research findings, conclusions and future plan.

2 Background and Related Work

Following concepts definitions and a literature review of correlated research areas Open Government Data, NLP and Collaboration Mining:

2.1 Open Government Data

Open Government Data referred to the datasets generated and published by governmental departments “without any restrictions on its usage or distribution” and it doesn’t contain any personal or undisclosed data [8]. OGD vary by multiple aspects for example: (a) OGD publishing department or agency domain e.g. Agriculture Data, Transport Data, Environmental Data, Financial Data and Telecommunication Data. (b) Data format e.g. Excel, Text, PDF, CSV, Theoretically, Government Open Data is operational or administrative governmental data available to use, redistribute, and analyze “in any form without any copyright restrictions” [9]. Regarding the open government working group draft in 2007^{Footnote 3} they generated initial open data principles: data must be complete, primary, timely, accessible, machine-processable, nondiscriminatory, nonproprietary, and license-free. Then they generated further open data principles, data must be online and free, permanent, trusted, assumed to be open, documented, safe to open, and designed with public input. Figure 2 shows the Irish government’s open data portal which we used for our experiments^{Footnote 4}.

2.2 Natural Language Processing

Following we discuss the correlated features of Natural Language Processing to our research. Specifically, Named Entity Recognition applications:

2.2.1 Named Entity Recognition

Named Entity Recognition is the process of discovering Named Entities (NE) laying within a given text, a common definition of NE is as follows [10], “an information unit described by the name of a person or an organization, a location, a brand, a product, a numeric expression including time, date, money and percent found in a sentence.” [11]. NER applications are implemented using multiple methodologies:

The Supervised Learning techniques use a big manually categorized dataset. Then this dataset is used for training the recognition algorithm. Supervised Learning techniques apply Conditional Random Fields [12], Hidden Markov Models [13], Decision Trees [14], Support Vector Machines [15] and Maximum Entropy Models [16] The objective of these methods is to identify and categorize related key-words. The unavailability of manually categorized datasets and the high cost of generating them, represent a challenging obstacle against Supervised Learning Techniques.

The Semi-Supervised Learning and Unsupervised Learning techniques use either a small categorized dataset for training the algorithm [17], or a clustering based algorithm. Further Unsupervised Learning techniques depend on lingual resources e.g. WordNet, and statistics to solve the NER task as a prediction problem [18].

2.3 Natural Language Processing in E-Government

There are few implementations of NLP technologies in the e-government area. Examples from the works found: A proposed application for gathering crime data from police departments and eyewitness stories and apply NLP technologies with GATE [19]. A system that imitate email answering process automatically or semi-automatically using NLP technologies [20]. Another application presents an original model for incorporating multimedia data to assist e-government tasks [21].

2.4 Mining for Collaboration

In general, due to the great benefits and possibilities of collaboration opportunities mining and discovery research e.g. Process speed enhancing, Standardization and Integration. The detection of possible collaboration opportunities within an organization or across multiple organizations and platforms is targeted in multiple domains. Following the few existing work digging into mining for collaboration area: Mining for collaboration in library domain, the research is harnessing the detection of possible collaboration opportunities with academic professional based on their publications to increase the benefits of students [22]. Collaboration mining between governmental levels and departments based on their objectives, resources and services to increase the government efficiency regarding public policy development and implementation, crisis management, etc. [23]. Collaboration mining tool using agent technology to analyze the collaboration between information on the web to help the tool users to get their desired materials more accurately and faster [24]. Collaboration mining of team members using summaries of successful past projects to increase moderator efficiency to promote project partner’s awareness of best way to formulate a proposal for a European research project [25].

3 Semantic Profiling for Collaboration Mining

As shown in Fig. 1 and zoomed in Fig. 3 we have designed a solution pipeline that incorporates Data Acquisition, Data Modeling, Data Analysis, and Data visualization technologies to enable the existence of a collaboration mining tool. We start with inputting the targeted open data portal(s) in which we seek mining for collaborations then we start acquiring metadata (catalogue) of the datasets. Then we restructure the catalogue to fit into the predesigned storage model (semantic profile), within this model we enhance, filter and exclude less important catalogue features – regarding our use case - and we add new features that are corresponding to our collaboration mining requirements e.g. we add “textual representation” feature by merging original textual features of the data catalogue, we add “Entities” feature to the new catalogue storage model by applying NER over the new “textual representation” feature of the catalogue, we filter features like “author” and “creator” to end up with only “publisher ID” feature, and we exclude “groups” and “tracking summary” features. After constructing and storing the new data model (semantic profile) we start the unstructured data analysis (text mining) pipeline by applying NER algorithm. At the end of that process we generate a comparable feature “Entities” and add it to the new data model to be used for collaboration mining. After that we construct dataset’s publisher data model (semantic profiles) which contains aggregated features’ values from their published datasets. Finally, we compute relation strengths between dataset publishers based on comparing their semantic profiles that we built using the aggregation of unique entities they publish datasets about and store it as shown in Fig. 4 for later visualization and web service usages as shown in Figs. 10 and 11.

Following we discuss and represent the results of our Semantic Profiling for Collaboration Mining approach.

3.1 Profiling the Catalogues

By querying the stored enriched metadata of open data portal we are able to generate charts that are profiling the underlying open data catalogue. As an example of those queries we are able to retrieve the named entities detected from mining unstructured textual representations of data catalogues generated by our tool. Those named entities which are originally derivate from dataset metadata are - same as their origin – able to demonstrate a description of the contents of the data portals see Figs. 5 and 6.

3.2 Publishers Profiles

Open data publishers are an interesting open data analysis feature; publishers could be governmental departments, councils, etc. which make their profiles a key component of governmental data integration and standardization. An open data publisher’s profile is the aggregation of the information extracted from its published dataset metadata. One of the usages of a publisher’s profile is to understand more about the domain of the publisher see Fig. 7 for an example.

3.3 Interlinking Publishers

The resulted publisher profiles are used to mine possible collaboration channels between data publishers at data portal level and among portals level by using the added comparable feature “Entities” see Figs. 8, 9 and 10.

According to our results “marine-institute (129) datasets” and “geological-survey-of-ireland (67) datasets” have the highest relation strength score of (82) which means that they share 82 entities/topics in common. We examined the datasets published by both publishers and we found that for pollution concept/topic there are (7) datasets published by “marine-institute” and (7) dataset published by “geological-survey-of-ireland” and similarly for hydrography concept/topic there are (4) datasets published by “marine-institute” and (18) datasets published by “geological-survey-of-ireland” as shown in Figs. 11 and 12.

3.4 Limitations

Named Entity Recognition area of the work is tightly coupled with the training and the quality of the Named Entity Recognition algorithm. Through this research we have experimented Natural Language Tool Kit (NLTK), Stanford NER and Stanford NER with nGram of (3) enhancement, then we ended up using DBpedia Spotlight as the NE source as through our manual examination of the text analysis phase results DBpedia out performed the other methods in its NE detection quality. DBpedia spotlihght still have its limitations though and we reported one of the issues we faced to their github repository^{Footnote 5}.

4 Applications

4.1 Standardization and Collaboration Analysis

Despite most of governments already publishing their data via their open data portals, when a government decides to integrate their data sources over its variant departments and councils, this heterogeneous domain dependent data will consume huge analysis resources and a considerably extended period of time to be fitted into an integrated data repository. Our profiling service will lead the way for data analysts to define integration channels, and necessary concepts standardizations between governmental departments and councils, using the available data published on open data portals. Same example would fit a multinational enterprise as well.

For example “marine-institute” and “geological-survey-of-ireland” share the named entity (pollution), this concept shall be standardized regarding its code and its measurement unit to ease integration and comparability or analysis in general among multiple datasets.

4.2 Intelligent Open Data Portals Exploration

Open data portals are meant to be facing the public in other words the citizens, but citizens can’t directly comprehend, and consume this row data [4]. Open data portals profiling service will help citizens to easily and intelligently explore the open data portal using visualized semantic profiles of publishers and datasets.

5 Conclusion and Future Work

Regarding our approach results we believe that we are on the right track to tackle the collaboration mining problem in open governmental data domain, as we are getting interested collaboration recommendations out of our pipeline in a visualized way that is easy to comprehend by general public users of open governmental data.

Our future plan is to overcome the NE limitation by developing a new text analysis pipeline that integrates statistical text analysis, babel.net^{Footnote 6}, and DBpedia^{Footnote 7} as our NE source. Also we are planning to replace the string comparison module with semantic relatedness comparison module as the way of calculating relation strength between open governmental data publishers.

Notes

References

Shadbolt, N., O’Hara, K., Berners-Lee, T., Gibbins, N., Glaser, H., Hall, W., Schraefel, M.C.: Linked open government data: lessons from data.gov.uk. IEEE Intell. Syst. 27, 16–24 (2012)
Article Google Scholar
Breitman, K., Salas, P., Casanova, M.A., Saraiva, D.: Open government data in Brazil. Intell. Syst. IEEE. 27, 45–49 (2012)
Article Google Scholar
Mutuku, L.N., Colaco, J.: Increasing Kenyan open data consumption. In: Proceedings of 6th International Conference on Theory and Practice of Electronic Governance, ICEGOV 2012, p. 18 (2012)
Google Scholar
Artigas, F., Chun, S.A.: Visual analytics for open government data. In: 14th Annual International Conference on Digital Government Research, From E-Government to Smart Gov.dg.o 2013, pp. 298–299 (2013)
Google Scholar
Ribeiro, D.C., Freire, J., Vo, H.T., Silva, C.T.: An urban data profiler. In: WWW Workshop Data Science Smart Cities, pp. 1389–1394 (2015)
Google Scholar
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76298-0_52
Chapter Google Scholar
Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems, pp. 1–8 (2011)
Google Scholar
Janssen, M., Charalabidis, Y., Zuiderwijk, A.: Benefits, adoption barriers and myths of open data and open government. Inf. Syst. Manag. 29, 258–268 (2012)
Article Google Scholar
Kassen, M.: A promising phenomenon of open data: a case study of the Chicago open data project. Gov. Inf. Q. 30, 508–513 (2013)
Article Google Scholar
Nadeau, D.: A survey of named entity recognition and classification. Linguist. Investig. 30, 3–26 (2007)
Article Google Scholar
Grishman, R.: Message Understanding Conference-6: A Brief History. In: Proceedings of COLING 1996 (1996)
Google Scholar
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 188–191. Association for Computational Linguistics, Morristown (2003)
Google Scholar
Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble. In: Proceedings of the Fifth Conference on Applied natural language processing, pp. 194–201. Association for Computational Linguistics, Morristown (1997)
Google Scholar
Borthwick, A., Sterling, J.: NYU: description of the MENE named entity system as used in MUC-7. In: Conference on MUC-7 (1998)
Google Scholar
Asahara, M., Matsumoto, Y.: Japanese Named Entity extraction with redundant morphological analysis. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, pp. 8–15. Association for Computational Linguistics, Morristown (2003)
Google Scholar
Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text, pp. 782–792 (2011)
Google Scholar
Ji, H., Grishman, R.: Data selection in semi-supervised learning for name tagging, pp. 48–55 (2006)
Google Scholar
Alfonseca, E., Manandhar, S.: An unsupervised method for general named entity recognition and automated concept discovery. In: Conference on General WordNet (2002)
Google Scholar
Ku, C.H., Iriberri, A., Leroy, G.: Natural Language Processing and e-Government: Crime Information Extraction from Heterogeneous Data Sources. In: The Proceedings of the 9th Annual International Digital Government Research Conference. ACM International Conference Proceedings Series, pp. 162–170. ACM Press (2006)
Google Scholar
Dalianis, H., Rosell, M., Sneiders, E.: Clustering e-mails for the Swedish social insurance agency – what part of the e-mail thread gives the best quality? In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) NLP 2010. LNCS, vol. 6233, pp. 115–120. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14770-8_14
Chapter Google Scholar
Amato, F., Mazzeo, A., Moscato, V., Picariello, A.: Semantic management of multimedia documents for e-government activity. In: 2009 International Conference on Complex, Intelligent and Software Intensive Systems, pp. 1193–1198 (2009)
Google Scholar
Williams, L.M., Cody, S.A., Parnell, J.: Prospecting for new collaborations: mining syllabi for library service opportunities. J. Acad. Librariansh. 30, 270–275 (2004)
Article Google Scholar
Basanya, R., Ojo, A., Janowski, T., Turini, F.: Mining collaboration opportunities to support joined-up government. In: Camarinha-Matos, Luis M., Pereira-Klen, A., Afsarmanesh, H. (eds.) PRO-VE 2011. IAICT, vol. 362, pp. 359–366. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23330-2_40
Chapter Google Scholar
Wan, L., Chen, J., Gu, D.: An information mining model of intelligent collaboration based on agent technology. In: International Conference on Applied Sciences, Engineering and Technology, ICASET 2014. Scientific.net (2014)
Google Scholar
Palmer, C., Harding, J.A., Swarnkar, R., Das, B.P., Young, R.I.M.: Generating rules from data mining for collaboration moderator services. J. Intell. Manuf. 24, 313–330 (2013)
Article Google Scholar

Download references

Acknowledgments

This paper is partially supported by European Union’s Horizon 2020 research and innovation programme under grant agreement No. 645860, project ROUTE-TO-PA (Raising Open and User-friendly Transparency-Enabling Technologies for Public Administrations).

Author information

Authors and Affiliations

Insight Centre for Data Analytics, National University of Ireland Galway, Galway, Ireland
Mohamed Adel Rezk, Adegboyega Ojo & Islam A. Hassan

Authors

Mohamed Adel Rezk
View author publications
You can also search for this author in PubMed Google Scholar
Adegboyega Ojo
View author publications
You can also search for this author in PubMed Google Scholar
Islam A. Hassan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohamed Adel Rezk .

Editor information

Editors and Affiliations

Universidade Nova de Lisboa, Monte Caparica, Portugal
Luis M. Camarinha-Matos
University of Amsterdam, Amsterdam, The Netherlands
Hamideh Afsarmanesh
ITIA-CNR, Milan, Italy
Rosanna Fornasiero

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Adel Rezk, M., Ojo, A., Hassan, I.A. (2017). Mining Governmental Collaboration Through Semantic Profiling of Open Data Catalogues and Publishers. In: Camarinha-Matos, L., Afsarmanesh, H., Fornasiero, R. (eds) Collaboration in a Data-Rich World. PRO-VE 2017. IFIP Advances in Information and Communication Technology, vol 506. Springer, Cham. https://doi.org/10.1007/978-3-319-65151-4_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-65151-4_24
Published: 22 August 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65150-7
Online ISBN: 978-3-319-65151-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)