Keywords

1 Introduction

Prior studies [1, 2] highlight the benefits of employing Linked Data for investment analysis, by combining information from Dbpedia, stockmarket patterns and different taxonomy versions of companies’ accounting statements. Increasingly, investors and regulators are demanding companies to disclose non-financial information, particularly firms’ impacts on the environment (referred to as sustainability) [4]. The voluntary nature of corporate sustainability reporting has resulted in the publication of inconsistent and incomplete information [4]. This has inhibited the manual creation of ontologies [3, 8]. In this study, we employ an automated ontology learning system to overcome this challenge. The proposed system, labelled SPARQL LDA, employs Latent Dirichlet Allocation (LDA) [5] for the discovery of topics to represent ontology concepts [68].

The system works in three phases. The first phase employs a Naïve Bayesian model to categorize text in sustainability disclosures. The model detects text related to a firm’s climate change impacts and aggregates sentences to create a composite document. The second phase employs a LDA topic model to detect contextual information in text. Topics are learned by retrieving terms via a SPARQL endpoint which are seeded as lexical priors into the LDA model. The final phase combines the LDA topic probabilities in an ensemble model and classifies the quality of corporate sustainability reporting using publically available disclosure ratings.

The rest of this study is structured as follows: Sect. 2 provides a brief overview of relevant sustainability datasets. In Sect. 3 we develop a system to evaluate the quality of corporates’ sustainability disclosures. Section 4 provides an empirical evaluation of the proposed system. We conclude in Sect. 5.

2 Environmental Sustainability Datasets

Prior earth science literature has explored the benefits of incorporating Semantic Web technologies to predict the impacts of climate change [912]. To our knowledge, literature has not considered the implications for companies or government regulatory policy. To aid such analysis we highlight two publically available datasets. The US Global Change Research Act of 1990 requires a National Climate Assessment (NCA) report [13] on the impact of climate change and affected industries. This includes a Global Change Information System (GCIS) which stores climate change metadata. GCIS resources are exported into a triple store queryable through a public SPARQL interface. A second dataset, published under the “Key stats and ratios” section of Google Finance, provides ratings to evaluate the quality of firms’ sustainability disclosures. These ratings are collected by the Carbon Disclosure Project (CDP), an initiative led by the United Nations, and are computed from annual surveys of domain experts. The highest CDP rating, ‘A’, corresponds to companies that are perceived to have published comprehensive climate change disclosures. The lowest rating, “E”, corresponds to companies with poor quality disclosures.

3 Model of Corporate Sustainability

3.1 Climate Change Aspect Detection

The first phase of the system employs a Naïve Bayesian classifier to detect salient aspects in text. A pre-processing step selects classification features from Wikipedia’s ‘Carbon emissions reporting’ page. The page provides an overview of corporate environmental reporting issues. We select the 10 most frequently occurring unigrams and bigrams as classification features: “climate”, “climate change”, “emissions”, “emitters”, “gas”, “ghg”, “greenhouse”, “scope 1”, “scope 2”, “scope 3”.

3.2 LDA Topic Model

The second phase of the system employs a LDA model [5] for the discovery of topics represented as ontology concepts [68, 16]. In LDA, a topic is modeled as a probability distribution over a set of words represented by a vocabulary and a document as a probability distribution over a set of topics. Our approach departs from a traditional LDA model [5] by seeding terms as lexical priors following the approach of [14]. Figure 1 displays the SPARQL query which retrieves the key recommendations from the latest NCA report using the GCIS interface (see Sect. 2). The unique terms (excluding stopwords) generated by the query form the LDA model’s vocabulary.

Fig. 1.
figure 1

SPARQL query used to retrieve a earth science terms

We implement standard settings for LDA hyperparameters with α = 50/K and β = .01 [15]. The number of topics, K, is set to five following a heuristic approach based on the number of climate change topics reported in the latest NCA report [13]. Table 1 displays the top terms associated with the topic clusters. Cluster labels are manually annotated to aid the reader’s interpretation.

Table 1. Topic clusters and top words identified by SPARQL LDA

The outcome of the model is a finer-grained categorization of companies’ disclosures based on topics discussed by the online scientific community. The probabilities associated with each topic cluster are included as components within the ensemble tree.

4 Ensemble Model

In this section we outline the evaluation of the ensemble classification tree, present the results and briefly conclude.

4.1 Data

Sustainability disclosures are reported annually on company websites. We retrieve a sample of 443 reports via the Google search query: “sustainability report type:pdf site:” followed by companies’ urls obtained from DBpedia (dbpedia-owl:wikiPageExternalLink). Document text is extracted using PDFMiner. To evaluate the ensemble tree’s classifications, we create a Boolean which takes a value of one if a company’s CDP disclosure is ‘A’ rated and zero otherwise (see Sect. 2).

4.2 Experimental Setup

We design the evaluation by comparing two systems. The benchmark employs a traditional LDA model and infers topics using only the underlying collection of documents. The SPARQL LDA system incorporates lexical priors by seeding the SPARQL generated vocabulary (see Sect. 3.2). Any differences in classification between the two systems can be explained by the different approaches to topic learning. Experiments were validated using 10-fold cross validation. The performance is evaluated in terms of Precision, Recall, and F1-measure:

$$ recall = \frac{TP}{TP + FN}\quad precision = \frac{TP}{TP + FP}\quad F1\;measure = \frac{precision*recall}{precision + recall} $$

The evaluation metrics are shown in Table 2.

Table 2. System evaluation

Precision for the SPARQL LDA system improves by 33 % versus the traditional LDA approach.

5 Conclusion

The manual building of ontologies is a time-consuming and costly process particularly in fast evolving domains of knowledge such as earth science, where information is updated often. In this paper we employ a fully-automated method for learning ontologies to alleviate the need for manual approaches. Our findings point to the benefits of integrating Linked Data for investors’ analyses of both financial and non-financial disclosures.