Keywords

1 Introduction

Despite the availability of data loaded into open data portals worldwideFootnote 1 [1, 2], methods to maximize stakeholders’ engagement and ease data integration still not complete [3,4,5]. We believe that a proper mining of collaboration channels within a single data portal internally as well as between multiple open data portals are not introduced yet. Our work is aiming to develop an open data portals collaboration channels mining framework as shown in Fig. 1. To achieve this, we start with data acquisition by harvesting metadata of datasets published on the portal then restructure and store them in MongoDBFootnote 2. Afterwards we construct textual representation from the dataset metadata’s unstructured features, apply DBpedia [6] Named Entity Recognition pipeline called DBpedia Spotlight [7] to extract information that represent those dataset and their publishers as well. After that we end up with a semantically enriched dataset upon which we can apply our profiling [5] and collaboration opportunities analysis. To illustrate our work, we organized the paper as follows: Sect. 2 presents a background on Open Government Data, NLP and Collaboration Mining. Section 3 discusses our approach to tackle the research question. Section 4. Discussing our research findings, conclusions and future plan.

Fig. 1.
figure 1

Open data catalogues and publishers’ semantic profiling conceptual framework.

2 Background and Related Work

Following concepts definitions and a literature review of correlated research areas Open Government Data, NLP and Collaboration Mining:

2.1 Open Government Data

Open Government Data referred to the datasets generated and published by governmental departments “without any restrictions on its usage or distribution” and it doesn’t contain any personal or undisclosed data [8]. OGD vary by multiple aspects for example: (a) OGD publishing department or agency domain e.g. Agriculture Data, Transport Data, Environmental Data, Financial Data and Telecommunication Data. (b) Data format e.g. Excel, Text, PDF, CSV, Theoretically, Government Open Data is operational or administrative governmental data available to use, redistribute, and analyze “in any form without any copyright restrictions” [9]. Regarding the open government working group draft in 2007Footnote 3 they generated initial open data principles: data must be complete, primary, timely, accessible, machine-processable, nondiscriminatory, nonproprietary, and license-free. Then they generated further open data principles, data must be online and free, permanent, trusted, assumed to be open, documented, safe to open, and designed with public input. Figure 2 shows the Irish government’s open data portal which we used for our experimentsFootnote 4.

Fig. 2.
figure 2

Irish government’s open data portal.

2.2 Natural Language Processing

Following we discuss the correlated features of Natural Language Processing to our research. Specifically, Named Entity Recognition applications:

2.2.1 Named Entity Recognition

Named Entity Recognition is the process of discovering Named Entities (NE) laying within a given text, a common definition of NE is as follows [10], “an information unit described by the name of a person or an organization, a location, a brand, a product, a numeric expression including time, date, money and percent found in a sentence.” [11]. NER applications are implemented using multiple methodologies:

The Supervised Learning techniques use a big manually categorized dataset. Then this dataset is used for training the recognition algorithm. Supervised Learning techniques apply Conditional Random Fields [12], Hidden Markov Models [13], Decision Trees [14], Support Vector Machines [15] and Maximum Entropy Models [16] The objective of these methods is to identify and categorize related key-words. The unavailability of manually categorized datasets and the high cost of generating them, represent a challenging obstacle against Supervised Learning Techniques.

The Semi-Supervised Learning and Unsupervised Learning techniques use either a small categorized dataset for training the algorithm [17], or a clustering based algorithm. Further Unsupervised Learning techniques depend on lingual resources e.g. WordNet, and statistics to solve the NER task as a prediction problem [18].

2.3 Natural Language Processing in E-Government

There are few implementations of NLP technologies in the e-government area. Examples from the works found: A proposed application for gathering crime data from police departments and eyewitness stories and apply NLP technologies with GATE [19]. A system that imitate email answering process automatically or semi-automatically using NLP technologies [20]. Another application presents an original model for incorporating multimedia data to assist e-government tasks [21].

2.4 Mining for Collaboration

In general, due to the great benefits and possibilities of collaboration opportunities mining and discovery research e.g. Process speed enhancing, Standardization and Integration. The detection of possible collaboration opportunities within an organization or across multiple organizations and platforms is targeted in multiple domains. Following the few existing work digging into mining for collaboration area: Mining for collaboration in library domain, the research is harnessing the detection of possible collaboration opportunities with academic professional based on their publications to increase the benefits of students [22]. Collaboration mining between governmental levels and departments based on their objectives, resources and services to increase the government efficiency regarding public policy development and implementation, crisis management, etc. [23]. Collaboration mining tool using agent technology to analyze the collaboration between information on the web to help the tool users to get their desired materials more accurately and faster [24]. Collaboration mining of team members using summaries of successful past projects to increase moderator efficiency to promote project partner’s awareness of best way to formulate a proposal for a European research project [25].

3 Semantic Profiling for Collaboration Mining

As shown in Fig. 1 and zoomed in Fig. 3 we have designed a solution pipeline that incorporates Data Acquisition, Data Modeling, Data Analysis, and Data visualization technologies to enable the existence of a collaboration mining tool. We start with inputting the targeted open data portal(s) in which we seek mining for collaborations then we start acquiring metadata (catalogue) of the datasets. Then we restructure the catalogue to fit into the predesigned storage model (semantic profile), within this model we enhance, filter and exclude less important catalogue features – regarding our use case - and we add new features that are corresponding to our collaboration mining requirements e.g. we add “textual representation” feature by merging original textual features of the data catalogue, we add “Entities” feature to the new catalogue storage model by applying NER over the new “textual representation” feature of the catalogue, we filter features like “author” and “creator” to end up with only “publisher ID” feature, and we exclude “groups” and “tracking summary” features. After constructing and storing the new data model (semantic profile) we start the unstructured data analysis (text mining) pipeline by applying NER algorithm. At the end of that process we generate a comparable feature “Entities” and add it to the new data model to be used for collaboration mining. After that we construct dataset’s publisher data model (semantic profiles) which contains aggregated features’ values from their published datasets. Finally, we compute relation strengths between dataset publishers based on comparing their semantic profiles that we built using the aggregation of unique entities they publish datasets about and store it as shown in Fig. 4 for later visualization and web service usages as shown in Figs. 10 and 11.

Fig. 3.
figure 3

Unstructured data analysis (Text Mining).

Fig. 4.
figure 4

Publisher collaboration network.

Following we discuss and represent the results of our Semantic Profiling for Collaboration Mining approach.

3.1 Profiling the Catalogues

By querying the stored enriched metadata of open data portal we are able to generate charts that are profiling the underlying open data catalogue. As an example of those queries we are able to retrieve the named entities detected from mining unstructured textual representations of data catalogues generated by our tool. Those named entities which are originally derivate from dataset metadata are - same as their origin – able to demonstrate a description of the contents of the data portals see Figs. 5 and 6.

Fig. 5.
figure 5

Top named entities describing the open data portal “data.gov.ie”.

Fig. 6.
figure 6

Top named entities types describing the open data portal “data.gov.ie”.

3.2 Publishers Profiles

Open data publishers are an interesting open data analysis feature; publishers could be governmental departments, councils, etc. which make their profiles a key component of governmental data integration and standardization. An open data publisher’s profile is the aggregation of the information extracted from its published dataset metadata. One of the usages of a publisher’s profile is to understand more about the domain of the publisher see Fig. 7 for an example.

Fig. 7.
figure 7

Top named entities describing data posted by top publishers to the open data portal “data.gov.ie”.

3.3 Interlinking Publishers

The resulted publisher profiles are used to mine possible collaboration channels between data publishers at data portal level and among portals level by using the added comparable feature “Entities” see Figs. 8, 9 and 10.

According to our results “marine-institute (129) datasets” and “geological-survey-of-ireland (67) datasets” have the highest relation strength score of (82) which means that they share 82 entities/topics in common. We examined the datasets published by both publishers and we found that for pollution concept/topic there are (7) datasets published by “marine-institute” and (7) dataset published by “geological-survey-of-ireland” and similarly for hydrography concept/topic there are (4) datasets published by “marine-institute” and (18) datasets published by “geological-survey-of-ireland” as shown in Figs. 11 and 12.

Fig. 8.
figure 8

Publishers collaboration network of open data portal “data.gov.ie” with relation strength >20.

Fig. 9.
figure 9

Publishers collaboration network of open data portal “data.gov.ie” – Showing highest relation strength score between “marine-institute” and “geological-survey-of-ireland”.

Fig. 10.
figure 10

Publishers mined relations of open data portal “data.gov.ie”.

Fig. 11.
figure 11

Datasets shared between Marine Institute and Geological Survey of Ireland around the concept pollution (https://data.gov.ie/data/search?q=pollution&publisher=marine-institute) (https://data.gov.ie/data/search?q=pollution&publisher=geological-survey-of-ireland)

Fig. 12.
figure 12

Datasets shared between Marine Institute and Geological Survey of Ireland around the concept hydrography. (https://data.gov.ie/data/search?q=hydrography&publisher=marine-institute) (https://data.gov.ie/data/search?q=hydrography&publisher=geological-survey-of-ireland)

3.4 Limitations

Named Entity Recognition area of the work is tightly coupled with the training and the quality of the Named Entity Recognition algorithm. Through this research we have experimented Natural Language Tool Kit (NLTK), Stanford NER and Stanford NER with nGram of (3) enhancement, then we ended up using DBpedia Spotlight as the NE source as through our manual examination of the text analysis phase results DBpedia out performed the other methods in its NE detection quality. DBpedia spotlihght still have its limitations though and we reported one of the issues we faced to their github repositoryFootnote 5.

4 Applications

4.1 Standardization and Collaboration Analysis

Despite most of governments already publishing their data via their open data portals, when a government decides to integrate their data sources over its variant departments and councils, this heterogeneous domain dependent data will consume huge analysis resources and a considerably extended period of time to be fitted into an integrated data repository. Our profiling service will lead the way for data analysts to define integration channels, and necessary concepts standardizations between governmental departments and councils, using the available data published on open data portals. Same example would fit a multinational enterprise as well.

For example “marine-institute” and “geological-survey-of-ireland” share the named entity (pollution), this concept shall be standardized regarding its code and its measurement unit to ease integration and comparability or analysis in general among multiple datasets.

4.2 Intelligent Open Data Portals Exploration

Open data portals are meant to be facing the public in other words the citizens, but citizens can’t directly comprehend, and consume this row data [4]. Open data portals profiling service will help citizens to easily and intelligently explore the open data portal using visualized semantic profiles of publishers and datasets.

5 Conclusion and Future Work

Regarding our approach results we believe that we are on the right track to tackle the collaboration mining problem in open governmental data domain, as we are getting interested collaboration recommendations out of our pipeline in a visualized way that is easy to comprehend by general public users of open governmental data.

Our future plan is to overcome the NE limitation by developing a new text analysis pipeline that integrates statistical text analysis, babel.netFootnote 6, and DBpediaFootnote 7 as our NE source. Also we are planning to replace the string comparison module with semantic relatedness comparison module as the way of calculating relation strength between open governmental data publishers.