Abstract
Due to the increasing adoption of open data among governments worldwide especially in the European Union area, a deeper analysis of the newly published data is becoming a mandate. Apart from analyzing the published dataset itself we aimed on analyzing published dataset catalogues. A dataset catalogue or a dataset metadata contains features that describe what the data is about in a textual representation. So, we first acquire data from open data portals, choose descriptive dataset catalogue features, and then construct an aggregated textual representation of the datasets. Afterwards we enrich those textual representations using Natural Language Processing (NLP) methods to create a new comparable data feature “Named Entities”. By mining the new data feature we are able to produce datasets and publishers relatedness network. Those networks are used to point similarities between the published data across multiple open data portals. Pointing all possible collaborations for integrating and standardizing data features and types would increase the value of da1ta and ease its analysis process.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Despite the availability of data loaded into open data portals worldwideFootnote 1 [1, 2], methods to maximize stakeholders’ engagement and ease data integration still not complete [3,4,5]. We believe that a proper mining of collaboration channels within a single data portal internally as well as between multiple open data portals are not introduced yet. Our work is aiming to develop an open data portals collaboration channels mining framework as shown in Fig. 1. To achieve this, we start with data acquisition by harvesting metadata of datasets published on the portal then restructure and store them in MongoDBFootnote 2. Afterwards we construct textual representation from the dataset metadata’s unstructured features, apply DBpedia [6] Named Entity Recognition pipeline called DBpedia Spotlight [7] to extract information that represent those dataset and their publishers as well. After that we end up with a semantically enriched dataset upon which we can apply our profiling [5] and collaboration opportunities analysis. To illustrate our work, we organized the paper as follows: Sect. 2 presents a background on Open Government Data, NLP and Collaboration Mining. Section 3 discusses our approach to tackle the research question. Section 4. Discussing our research findings, conclusions and future plan.
2 Background and Related Work
Following concepts definitions and a literature review of correlated research areas Open Government Data, NLP and Collaboration Mining:
2.1 Open Government Data
Open Government Data referred to the datasets generated and published by governmental departments “without any restrictions on its usage or distribution” and it doesn’t contain any personal or undisclosed data [8]. OGD vary by multiple aspects for example: (a) OGD publishing department or agency domain e.g. Agriculture Data, Transport Data, Environmental Data, Financial Data and Telecommunication Data. (b) Data format e.g. Excel, Text, PDF, CSV, Theoretically, Government Open Data is operational or administrative governmental data available to use, redistribute, and analyze “in any form without any copyright restrictions” [9]. Regarding the open government working group draft in 2007Footnote 3 they generated initial open data principles: data must be complete, primary, timely, accessible, machine-processable, nondiscriminatory, nonproprietary, and license-free. Then they generated further open data principles, data must be online and free, permanent, trusted, assumed to be open, documented, safe to open, and designed with public input. Figure 2 shows the Irish government’s open data portal which we used for our experimentsFootnote 4.
2.2 Natural Language Processing
Following we discuss the correlated features of Natural Language Processing to our research. Specifically, Named Entity Recognition applications:
2.2.1 Named Entity Recognition
Named Entity Recognition is the process of discovering Named Entities (NE) laying within a given text, a common definition of NE is as follows [10], “an information unit described by the name of a person or an organization, a location, a brand, a product, a numeric expression including time, date, money and percent found in a sentence.” [11]. NER applications are implemented using multiple methodologies:
The Supervised Learning techniques use a big manually categorized dataset. Then this dataset is used for training the recognition algorithm. Supervised Learning techniques apply Conditional Random Fields [12], Hidden Markov Models [13], Decision Trees [14], Support Vector Machines [15] and Maximum Entropy Models [16] The objective of these methods is to identify and categorize related key-words. The unavailability of manually categorized datasets and the high cost of generating them, represent a challenging obstacle against Supervised Learning Techniques.
The Semi-Supervised Learning and Unsupervised Learning techniques use either a small categorized dataset for training the algorithm [17], or a clustering based algorithm. Further Unsupervised Learning techniques depend on lingual resources e.g. WordNet, and statistics to solve the NER task as a prediction problem [18].
2.3 Natural Language Processing in E-Government
There are few implementations of NLP technologies in the e-government area. Examples from the works found: A proposed application for gathering crime data from police departments and eyewitness stories and apply NLP technologies with GATE [19]. A system that imitate email answering process automatically or semi-automatically using NLP technologies [20]. Another application presents an original model for incorporating multimedia data to assist e-government tasks [21].
2.4 Mining for Collaboration
In general, due to the great benefits and possibilities of collaboration opportunities mining and discovery research e.g. Process speed enhancing, Standardization and Integration. The detection of possible collaboration opportunities within an organization or across multiple organizations and platforms is targeted in multiple domains. Following the few existing work digging into mining for collaboration area: Mining for collaboration in library domain, the research is harnessing the detection of possible collaboration opportunities with academic professional based on their publications to increase the benefits of students [22]. Collaboration mining between governmental levels and departments based on their objectives, resources and services to increase the government efficiency regarding public policy development and implementation, crisis management, etc. [23]. Collaboration mining tool using agent technology to analyze the collaboration between information on the web to help the tool users to get their desired materials more accurately and faster [24]. Collaboration mining of team members using summaries of successful past projects to increase moderator efficiency to promote project partner’s awareness of best way to formulate a proposal for a European research project [25].
3 Semantic Profiling for Collaboration Mining
As shown in Fig. 1 and zoomed in Fig. 3 we have designed a solution pipeline that incorporates Data Acquisition, Data Modeling, Data Analysis, and Data visualization technologies to enable the existence of a collaboration mining tool. We start with inputting the targeted open data portal(s) in which we seek mining for collaborations then we start acquiring metadata (catalogue) of the datasets. Then we restructure the catalogue to fit into the predesigned storage model (semantic profile), within this model we enhance, filter and exclude less important catalogue features – regarding our use case - and we add new features that are corresponding to our collaboration mining requirements e.g. we add “textual representation” feature by merging original textual features of the data catalogue, we add “Entities” feature to the new catalogue storage model by applying NER over the new “textual representation” feature of the catalogue, we filter features like “author” and “creator” to end up with only “publisher ID” feature, and we exclude “groups” and “tracking summary” features. After constructing and storing the new data model (semantic profile) we start the unstructured data analysis (text mining) pipeline by applying NER algorithm. At the end of that process we generate a comparable feature “Entities” and add it to the new data model to be used for collaboration mining. After that we construct dataset’s publisher data model (semantic profiles) which contains aggregated features’ values from their published datasets. Finally, we compute relation strengths between dataset publishers based on comparing their semantic profiles that we built using the aggregation of unique entities they publish datasets about and store it as shown in Fig. 4 for later visualization and web service usages as shown in Figs. 10 and 11.
Following we discuss and represent the results of our Semantic Profiling for Collaboration Mining approach.
3.1 Profiling the Catalogues
By querying the stored enriched metadata of open data portal we are able to generate charts that are profiling the underlying open data catalogue. As an example of those queries we are able to retrieve the named entities detected from mining unstructured textual representations of data catalogues generated by our tool. Those named entities which are originally derivate from dataset metadata are - same as their origin – able to demonstrate a description of the contents of the data portals see Figs. 5 and 6.
3.2 Publishers Profiles
Open data publishers are an interesting open data analysis feature; publishers could be governmental departments, councils, etc. which make their profiles a key component of governmental data integration and standardization. An open data publisher’s profile is the aggregation of the information extracted from its published dataset metadata. One of the usages of a publisher’s profile is to understand more about the domain of the publisher see Fig. 7 for an example.
3.3 Interlinking Publishers
The resulted publisher profiles are used to mine possible collaboration channels between data publishers at data portal level and among portals level by using the added comparable feature “Entities” see Figs. 8, 9 and 10.
According to our results “marine-institute (129) datasets” and “geological-survey-of-ireland (67) datasets” have the highest relation strength score of (82) which means that they share 82 entities/topics in common. We examined the datasets published by both publishers and we found that for pollution concept/topic there are (7) datasets published by “marine-institute” and (7) dataset published by “geological-survey-of-ireland” and similarly for hydrography concept/topic there are (4) datasets published by “marine-institute” and (18) datasets published by “geological-survey-of-ireland” as shown in Figs. 11 and 12.
3.4 Limitations
Named Entity Recognition area of the work is tightly coupled with the training and the quality of the Named Entity Recognition algorithm. Through this research we have experimented Natural Language Tool Kit (NLTK), Stanford NER and Stanford NER with nGram of (3) enhancement, then we ended up using DBpedia Spotlight as the NE source as through our manual examination of the text analysis phase results DBpedia out performed the other methods in its NE detection quality. DBpedia spotlihght still have its limitations though and we reported one of the issues we faced to their github repositoryFootnote 5.
4 Applications
4.1 Standardization and Collaboration Analysis
Despite most of governments already publishing their data via their open data portals, when a government decides to integrate their data sources over its variant departments and councils, this heterogeneous domain dependent data will consume huge analysis resources and a considerably extended period of time to be fitted into an integrated data repository. Our profiling service will lead the way for data analysts to define integration channels, and necessary concepts standardizations between governmental departments and councils, using the available data published on open data portals. Same example would fit a multinational enterprise as well.
For example “marine-institute” and “geological-survey-of-ireland” share the named entity (pollution), this concept shall be standardized regarding its code and its measurement unit to ease integration and comparability or analysis in general among multiple datasets.
4.2 Intelligent Open Data Portals Exploration
Open data portals are meant to be facing the public in other words the citizens, but citizens can’t directly comprehend, and consume this row data [4]. Open data portals profiling service will help citizens to easily and intelligently explore the open data portal using visualized semantic profiles of publishers and datasets.
5 Conclusion and Future Work
Regarding our approach results we believe that we are on the right track to tackle the collaboration mining problem in open governmental data domain, as we are getting interested collaboration recommendations out of our pipeline in a visualized way that is easy to comprehend by general public users of open governmental data.
Our future plan is to overcome the NE limitation by developing a new text analysis pipeline that integrates statistical text analysis, babel.netFootnote 6, and DBpediaFootnote 7 as our NE source. Also we are planning to replace the string comparison module with semantic relatedness comparison module as the way of calculating relation strength between open governmental data publishers.
References
Shadbolt, N., O’Hara, K., Berners-Lee, T., Gibbins, N., Glaser, H., Hall, W., Schraefel, M.C.: Linked open government data: lessons from data.gov.uk. IEEE Intell. Syst. 27, 16–24 (2012)
Breitman, K., Salas, P., Casanova, M.A., Saraiva, D.: Open government data in Brazil. Intell. Syst. IEEE. 27, 45–49 (2012)
Mutuku, L.N., Colaco, J.: Increasing Kenyan open data consumption. In: Proceedings of 6th International Conference on Theory and Practice of Electronic Governance, ICEGOV 2012, p. 18 (2012)
Artigas, F., Chun, S.A.: Visual analytics for open government data. In: 14th Annual International Conference on Digital Government Research, From E-Government to Smart Gov.dg.o 2013, pp. 298–299 (2013)
Ribeiro, D.C., Freire, J., Vo, H.T., Silva, C.T.: An urban data profiler. In: WWW Workshop Data Science Smart Cities, pp. 1389–1394 (2015)
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., Choi, K.-S., Noy, N., Allemang, D., Lee, K.-I., Nixon, L., Golbeck, J., Mika, P., Maynard, D., Mizoguchi, R., Schreiber, G., Cudré-Mauroux, P. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76298-0_52
Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems, pp. 1–8 (2011)
Janssen, M., Charalabidis, Y., Zuiderwijk, A.: Benefits, adoption barriers and myths of open data and open government. Inf. Syst. Manag. 29, 258–268 (2012)
Kassen, M.: A promising phenomenon of open data: a case study of the Chicago open data project. Gov. Inf. Q. 30, 508–513 (2013)
Nadeau, D.: A survey of named entity recognition and classification. Linguist. Investig. 30, 3–26 (2007)
Grishman, R.: Message Understanding Conference-6: A Brief History. In: Proceedings of COLING 1996 (1996)
McCallum, A., Li, W.: Early results for named entity recognition with conditional random fields, feature induction and web-enhanced lexicons. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 188–191. Association for Computational Linguistics, Morristown (2003)
Bikel, D.M., Miller, S., Schwartz, R., Weischedel, R.: Nymble. In: Proceedings of the Fifth Conference on Applied natural language processing, pp. 194–201. Association for Computational Linguistics, Morristown (1997)
Borthwick, A., Sterling, J.: NYU: description of the MENE named entity system as used in MUC-7. In: Conference on MUC-7 (1998)
Asahara, M., Matsumoto, Y.: Japanese Named Entity extraction with redundant morphological analysis. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, pp. 8–15. Association for Computational Linguistics, Morristown (2003)
Hoffart, J., Yosef, M.A., Bordino, I., Fürstenau, H., Pinkal, M., Spaniol, M., Taneva, B., Thater, S., Weikum, G.: Robust disambiguation of named entities in text, pp. 782–792 (2011)
Ji, H., Grishman, R.: Data selection in semi-supervised learning for name tagging, pp. 48–55 (2006)
Alfonseca, E., Manandhar, S.: An unsupervised method for general named entity recognition and automated concept discovery. In: Conference on General WordNet (2002)
Ku, C.H., Iriberri, A., Leroy, G.: Natural Language Processing and e-Government: Crime Information Extraction from Heterogeneous Data Sources. In: The Proceedings of the 9th Annual International Digital Government Research Conference. ACM International Conference Proceedings Series, pp. 162–170. ACM Press (2006)
Dalianis, H., Rosell, M., Sneiders, E.: Clustering e-mails for the Swedish social insurance agency – what part of the e-mail thread gives the best quality? In: Loftsson, H., Rögnvaldsson, E., Helgadóttir, S. (eds.) NLP 2010. LNCS, vol. 6233, pp. 115–120. Springer, Heidelberg (2010). doi:10.1007/978-3-642-14770-8_14
Amato, F., Mazzeo, A., Moscato, V., Picariello, A.: Semantic management of multimedia documents for e-government activity. In: 2009 International Conference on Complex, Intelligent and Software Intensive Systems, pp. 1193–1198 (2009)
Williams, L.M., Cody, S.A., Parnell, J.: Prospecting for new collaborations: mining syllabi for library service opportunities. J. Acad. Librariansh. 30, 270–275 (2004)
Basanya, R., Ojo, A., Janowski, T., Turini, F.: Mining collaboration opportunities to support joined-up government. In: Camarinha-Matos, Luis M., Pereira-Klen, A., Afsarmanesh, H. (eds.) PRO-VE 2011. IAICT, vol. 362, pp. 359–366. Springer, Heidelberg (2011). doi:10.1007/978-3-642-23330-2_40
Wan, L., Chen, J., Gu, D.: An information mining model of intelligent collaboration based on agent technology. In: International Conference on Applied Sciences, Engineering and Technology, ICASET 2014. Scientific.net (2014)
Palmer, C., Harding, J.A., Swarnkar, R., Das, B.P., Young, R.I.M.: Generating rules from data mining for collaboration moderator services. J. Intell. Manuf. 24, 313–330 (2013)
Acknowledgments
This paper is partially supported by European Union’s Horizon 2020 research and innovation programme under grant agreement No. 645860, project ROUTE-TO-PA (Raising Open and User-friendly Transparency-Enabling Technologies for Public Administrations).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 IFIP International Federation for Information Processing
About this paper
Cite this paper
Adel Rezk, M., Ojo, A., Hassan, I.A. (2017). Mining Governmental Collaboration Through Semantic Profiling of Open Data Catalogues and Publishers. In: Camarinha-Matos, L., Afsarmanesh, H., Fornasiero, R. (eds) Collaboration in a Data-Rich World. PRO-VE 2017. IFIP Advances in Information and Communication Technology, vol 506. Springer, Cham. https://doi.org/10.1007/978-3-319-65151-4_24
Download citation
DOI: https://doi.org/10.1007/978-3-319-65151-4_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65150-7
Online ISBN: 978-3-319-65151-4
eBook Packages: Computer ScienceComputer Science (R0)