Scientific Datasets: Informetric Characteristics and Social Utility Metrics for Biodiversity Data Sources

Ingwersen, Peter

doi:10.1007/978-3-642-54812-3_8

Scientific Datasets: Informetric Characteristics and Social Utility Metrics for Biodiversity Data Sources

Peter Ingwersen³

Chapter
Open Access
First Online: 01 January 2014

28k Accesses
2 Citations

Abstract

The contribution places biodiversity datasets in relation to other central elements of the modern scientific communication system and defines quantitative analyses of metadata of such datasets as belonging to the intersection of Scientometrics and Webometrics. The analyses show that rank distributions of social utility evidence, such as search events and retrieved and viewed dataset records over a given range of datasets follow power law characteristics. A variety of dataset usage index (DUI) metrics is exemplified and illustrated by dataset indicators from three large, medium and small US and Danish dataset providers observed over a one-year period and compared to recent developments. Metrics discussed are of absolute as well as relative nature and include popularity, social attractiveness, and usage and interest impact scores.

Peter Ingwersen, Professor Emeritus, Royal School of Information and Library Science University of, Copenhagen, Denmark. Research areas: Interactive IR evaluation and theory; Webometrics-Scientometrics; Research evaluation. Awarded the Derek de Solla Price Medal, 2005 and the ASIST Research Award, 2003. He has initiated the CoLIS and IIIX conferences and published several highly cited research monographs and journal articles on IR and Scientometrics. Telephone: +45 32 34 15 00; E-mail: clb798@iva.ku.dk

You have full access to this open access chapter, Download chapter PDF

Introduction

Scientific datasets are becoming increasingly vital to understand as a central component of the modern scientific communication process—Fig. 1. Like for academic publications indexed in traditional citation databases, such as the Web of Science, PubMed or SCOPUS, entire datasets do rarely become deleted from the database or archive. Their original records are rarely edited or erased; but datasets, in particular biodiversity datasets, may indeed be updated and grow in number of records over time or be modified or restructured. This characteristic is associated with the potential for change also observed in many Web-based documents. However, unlike references given in academic publications crediting influence or direct knowledge import from other publications no common standards are available for crediting scientific datasets across the array of disciplines (Green 2009). Thus, none of the aforementioned citation-based systems explicitly take into account scientific datasets as targeted objects for use in academic work.

For biodiversity data a task force was working on this issue in order to generate recommendations for the foundation of a workable citation mechanism (Moritz et al. 2011). In addition, a set of Data Usage Index (DUI) indicators has been developed (Ingwersen and Chavan 2011). The central indicators for the development of a DUI were based on search events and dataset download instances. The DUI is intended also to provide novel insights into how scholars make use of primary biodiversity data in a variety of ways. Similar to scientometric analyses applying rank distributions, time series, impact measures and other calculations based on academic publications (Moed 2005), the social usage of primary biodiversity datasets has led to observations of their statistical characteristics as well as the development of a family of indicators and other derived significant measures. The indicators can be regarded a kind of social utility metrics which, like citations, ratings or recommendations, may be applied as impact measures in research evaluation and form supporting relevance evidence for retrieval purposes (Ingwersen and Järvelin 2005).

Initially, the presentation places the biodiversity dataset indicators within the framework of Informetrics, as a sub-section of scientometric analysis and associated with Webometrics. This is followed by examples of selected rank distribution properties of biodiversity datasets in order to observe if such distributions are similar to those observed for academic journals and articles, i.e. if they follow Bradford-like long-tail distributions. In such power-law-like cases it is expected that information management solutions similar to those used in repository management and libraries can be applied to biodiversity datasets. In addition, one may expect such statistical properties to lead to useful social utility-based research monitoring metrics. A selection of DUI indicators that are useful from this perspective, such asUsage andInterest Impactscores and relative data usage impact, will be highlighted and exemplified. The presentation ends with a brief discussion of consequences of the biodiversity dataset characteristics from the perspectives of dataset management, retrieval and evaluation.

Biodiversity Datasets in the Informetric Framework

The scientific communication system displayed in Fig. 1 (Ingwersen 2011) contains several key components that may serve as fix points for scientometric indicator developments. Foremost they center on official research output, such as conference proceeding papers and journal articles, but also monographic publications, working papers and research reports are relevant in this respect. Patents (not shown on the Figure) signify additional particular kinds of research output, with own databases and indicator systems. With increased accessibility through the Web institutional repository publications as well as a growing body of scientific datasets of various kinds are available to researchers. In particular, datasets are used and re-used in order to carry out many different kinds of analyses, e.g. meta-analyses; benchmarking; bio topographic studies; genomics analyses, etc. Like for publications, datasets can be analyzed for their properties, for instance, with respect to volume of records, objects or topics they index and describe, and properties of authorship. Biodiversity datasets are interesting, because most are available on the Web often in a standardized database setting, but they require a lot of work to establish and this resource is only indirectly credited in the publications actually relying on biodiversity datasets. Thus the development of the set of DUI indicators analyzed below.

By being accessible on the Web one might argue that biodiversity dataset indicators based on social usage (on the web) belong to Webometrics, alternatively to the range of so-called ‘altmetrics’ indicators (Kurtz and Bollen 2010), Fig. 2. Webometric analyses imply quantitative studies of the Web, including usage of web-based resources. ‘Altmetrics’ has recently been proposed as a sub-area of Webometrics fundamentally dealing with the study of usage of social media (on the Web) such as Twitter, Facebook, blogs, and similar social networks. Typically, the actual usage population is fairly unknown in ‘altmetric’ analyses—as in many but not all webometric research areas—implying that the statistical properties are difficult to assess or control. In biodiversity dataset usage this is also the case: who is behind the searching computer is unknown to the online analyst, but the geographical area from which the search is done is known to the biodiversity dataset server. In addition, some properties are well known: the affiliation of the dataset provider; the size of the dataset in question; the topics and objects covered by the dataset.

It is thus fair to state that Informetric analyses of biodiversity datasets belong to Scientometrics, i.e. quantitative analyses of the science system(s), using Bibliometric methods, such as rank distributions, and intersected with Webometrics since the datasets are available through the Web, Fig. 2. Whether to use the notion of ‘altmetrics’ or simply webometrics for the analyses made is an open question, which I as instigator of Webometrics as a study area (Thelwall et al. 2005) will let the community to decide.

Biodiversity Dataset Characteristics

The objectives of the proposed DUI were (Ingwersen and Chavan 2011, p. 2) “[to] make the dataset usage visible, providing deserved recognition of their creators, managers, and publishers and to encourage the biodiversity dataset publishers and users to:

Increase the volume of high quality data discovery, mobilisation and publishing;
Further use of primary biodiversity data in scientific, conservation, and sustainable resources use purposes; and
Improve formal citation behaviour regarding datasets in research.”

In order to do so understanding of the characteristics of the datasets and their behaviour in the scientific life-cycle is central. Biodiversity datasets are presently accessible online through the Global Biodiversity Information Facility (GBIF) located in Copenhagen, Denmark. The structure and prospects of GBIF is outlined by Chavan and Ingwersen (2009). The GBIF data portal was established in 2001 (http://data.gbif.org.) and holds currently over 400 million records published in more than 10,000 datasets by almost 500 data publishers, with the largest data set containing more than 21 million records. The Data Usage Index (DUI) indicator developments were based on data usage logs of the GBIF data portal. The logs provide general usage data on kinds of access and searches via IP addresses as well as download events of datasets within the control of the GBIF data portal. As a spin-off the usage logs also provides different rank distribution characteristics, which are directly accessible online for analysts through the GBIF Portal and its datasets—eventually via known dataset providers.

Table 1 Top-18 distribution of search events and number of records viewed, ranked by Search Events in the Danish Biodiversity Information Facility (DanBIF); (GBIF, December 1–31, 2009). Search density signifies no. of records per search event

Full size table

Table 1 demonstrates the top-rankings of a typical distribution of different datasets produced by the same dataset provider (Biodiversity data: Danish Biodiversity Information Facility, DanBIF, GBIF 2010) at a specific time period, i.e. one month, December 2009. During the selected time slot the provider was searched in total 5,704 times and the users looked at 207,622 records from the 36 available datasets, with an average search density of 36.4 records. Out of this volume the GBIF logs inform that 42,923 records were downloaded through 538 download events (average download density = 79.8 records), not shown on Table 1. Like for journal articles distributed over a publishing journal according to citations, the GBIF mobilized dataset records might be distributed over datasets according to usage (downloads) or searching.

Detailed analyses of the GBIF logs reveal that similar to articles vs. journals a Bradford distribution can be observed for searched biodiversity dataset records dispersed over datasets. A Bradford rank distribution of journals is a Gini-index like distribution of the power law forma; an; an ²—wherea signifies the number of journals publishing the upper tertile of articles (≈ datasets producing the upper tertile of records, searches or downloads) andn a constant specific for that scientific area (Garfield 1979; Moed 2005). Although the number of datasets in the distribution is quite small (36) we may, with good will, observe an approximation to a Bradford distribution for the searched records: the first tertile (69,207 records) of the total number of records (207,622) is covered by the top-2 1/2 datasets alone (sorted by Record Number = 79,273 records). The next 6 1/2 datasets cover 74,086 records, approximating the second tertile. The remaining 27 datasets cover the last tertile. This approximates toa = 2.5 datasets;a n = 2.5 × 3 (= 7 1/2 datasets) anda n ² = 2.5 × 9 (= 22 1/2 datasets). A Bradford distribution for a given range of datasets implies that very few datasets (2–3) cover a large portion (> 33 %) of the entire volume of records in the area covered by the range of datasets (here defined by the provider), followed by a long tail phenomenon.

In fact, the pattern shown is steeper than suggested by a standard Bradford distribution. More than 2/3 of the searched records in the DanBIF biodiversity collection (142,000 records) were covered by only 7 datasets (20 %), Fig. 3, right-hand side. From Table 1 we observe that of the top-10 datasets ranked according to search events (popularity) seven datasets were also those sets with most used records as searched and viewed by peer biodiversity researchers world-wide. The pattern can be monitored over time for consistency, see example Table 2 for the HUA provider. During the monitored month in 2009 the DOF dataset was the most used set according toSearched Records but ranked fifth with respect toSearching Event frequency, i.e. popularity. In addition the DOF dataset had the highest Search Density (105.5 records per search event).

Figure 4 displays the corresponding rank distribution of search events over the 36 DanBIF datasets during the same time slot, again providing a long tail distribution, but with two datasets standing out as most searched (popular) datasets. Cumulated they constitute 38 % of all events taking place during the period (2184 search events of a total of 5704 events), Fig. 4, right-hand side.

Data can be extracted from other elements of the GBIF data portal logs in order to generate rank distributions, e.g. associated with specific species or of frequent visits gaining access into specific dataset providers or datasets, via IP addresses. These latter distributions rank the top players thatimport knowledge in specific areas or from particular providers, datasets or species/taxa. Only the GBIF server staff is able to extract such data whilst the shown distributions are publicly available online. As part of its architecture the GBIF data portal supplies up-to-date lists of datasets as well as of dataset publishers, sorted alphabetically and detailing dataset name, Record Number and an entry to the dataset event log. The lists and structured event logs per dataset and provider can be downloaded easily (Fig. 5) and eventually re-ranked or manipulated statistically offline.

A recent online analysis of the GBIF event log demonstrates that, for instance, the Danish Mycological Society (Row 1, Table 1) at present holds 81,000 records, and during the month December 1–31, 2012 the dataset was searched 250,001 times retrieving and viewing 5,234,732 records with a search density of 20.9 records per event (Biodiversity data: Danish Mycological Society, GBIF 2013). These figures illustrate the dramatic increase of the usage of the GBIF portal over a period of one month during three years, see Table 1 for comparison.

Dataset Usage Index Indicators

In (Ingwersen and Chavan 2011) the range of DUI indicators is defined, exemplified and discussed. They are based on the extracts of data from the GBIF event logs for datasets and are constructed according to common scientometric standards for research evaluation indicators (Moed 2005). Below we point to the most prominent indicators and discuss briefly their potentials, since they are characteristic for biodiversity datasets that are publicly available, searched and downloaded. Table 2 demonstrates 13 of the indicators, exemplified by dataset properties from three different dataset providers: The large network-like US-based Ocean Biogeographic Information System (OBIS) and the two Danish providers DanBIF and Herbarium of University of Aarhus (HUA).

Table 2 Dataset indicator examples. Record No. as per December 31, 2009

Full size table

The Number of Datasets produced by a publisher (N(u)) at a given point in time may characterize the publisher into small (N < 10), Medium (10 < N < 100), Large (100 < N < 300) and ultra-large (N > 300). The reason behind this classification is that it is meaningless to compare between providers of quite different sizes. Like for citation impact small (often specialized) universities should not be compared to large universal universities. DanBIF is thus regarded a medium-size provider while HUA is seen as a small dataset producer. Table 2 provides an overall view of their characteristics (Ingwersen and Chavan 2011, p. 7).

For the large-scale US provider OBIS the analysis window is one month against 6 month for the two other providers. Comparisons should hence not be carried out them in between. TheUsage Ratio signifies the number of records downloaded over number of records searched during the same period. The higher the ratio the more searched records are also subsequently downloaded and imply a kind of social attractiveness of the datasets in question.

The table shows that regardless of length of analysis window the numbers ofSearched Records andDownload Frequency were quite substantial in 2009, supporting the conception of a DUI.Download Events were very low compared to the number ofSearch Events across all three publishers and periods. Three years later the GBIF portal seems well established in the mind of the global research community, Fig. 5, with aDownload Eventscore during the six-month period February–August 2013 for HUA raised to 4,226 events and with aDownload Density reaching 196.6 against 15,205Search Eventswith a corresponding much lowerDensity of 17.6 (Biodiversity data: Herbarium of University of Aarhus 2013).

TheUsage Balance betweenDownload andSearch Events was quite low in 2009: only approx. 1–2 % of the search events lead to direct downloading for the providers; for HUA less than 1 %. In 2013, for one HUA dataset, theUsage Balance reaches 28 %, implying that for each 4 search events there is one pure download event taking place, signifying that searchers seem more familiar with the dataset contents and do not require constantly to search and investigate the set prior to actual usage. This coincides with theUsage Ratio, or social attractiveness score, which for HUA during the six month in 2013 reaches 3.1 signifying that more than three times the searched records are actually downloaded.

According to Ingwersen and Chavan (2011, p. 7) theInterest andUsage Impact factors inform about the average number of times each record stored by a dataset publisher has been searched or actively downloaded. In both metrics a value greater than 1.0 implies that in principle all the dataset records on average have been searched or downloaded at least once during the analysis period. The two time slots, Table 2 (2009a, b), may illustrate the developments for a dataset provider like HUA during the entire year 2009, i.e., showing a slight decrease inUsage Impact (from 3.1 to 2.8) and a strong increase inInterest Impact (from 8.9 to 28.3). In contrast, during the recent six-month period in 2013 HUA’sUsage andInterest Impact values are 7.4 and 2.4 respectively^{Footnote 1}. TheUsage Impact has increased substantially while theInterest Impact has noticeably dropped. This is due to a strong increase in downloads and much less searching and viewing activity during the later period, in accordance with theUsage Balance andRatio scores.

Aside from the DUI indicators, Table 2, the event logs may in addition produce data on the most popular objects, i.e., the species in the dataset that are most searched and viewed during the selected analysis period. Such data constitutes theusage profile for a particular dataset and changes can be monitored over time.

These absolute DUI metrics can be turned into relative indicators, e.g. by relating single datasets to their provider’s cumulated properties or associating several providers to the national aggregation for particular indicators. The HUAUsage Impact Factor for 2009b relative to Denmark (U-IF/DK) is thus 2.77/0.32 = 8.65. The corresponding U-IF (DanBIF) is 0.53. Examples of relative DUI indicators and all formulas are shown in (Ingwersen and Chavan 2011).

Concluding Remarks

The presentation demonstrates the feasibility of establishing a framework for academic crediting of dataset production, searching and usage. The Dataset Usage Index signifies a step forward towards such a dataset management framework. The reason that the DUI is appropriate lies in the rank distribution properties which, among other characteristics, follow the pattern of power laws in proximity of Bradford distributions. Further, the distributions make it feasible to point to the most popular or socially attractive datasets, providers or species, monitored over time, and to apply such evidence in dataset management decisions as well as for retrieval purposes. The latter perspective reaches into types of recommendation systems commonly applied to other kinds of social media (Bogers and van den Bosch 2011). Because of their usage dimension biodiversity datasets, as well as other scientific datasets may be seen as particular kinds of cooperative filtering information systems.

In addition, a range of absolute as well as relative usage indicators has been defined and exemplified. Biodiversity datasets and their records seem to display some similar characteristics as journals and articles published in such journals. It is thus very likely that information management traits that have been found appropriate for academic journals and journal articles in repositories and libraries are equally useful for biodiversity and other scientific datasets. Similarly, a DUI is likely to serve as a convenient complement to traditional citation-based research monitoring, in particular with respect to institutional evaluations since the biodiversity datasets constitute a substantial workload otherwise not made visible in traditional research monitoring schemes.

Notes

1.
The number of records in the one HUA dataset available in August 2013 is 111,525.

References

Biodiversity occurrence data published by: Danish Biodiversity Information Facility (Accessed through GBIF Data Portal, data.gbif.org, 2010-01-01)
Google Scholar
Biodiversity occurrence data published by: Danish Mycological Society (Accessed through GBIF Portal, data.gbif.org, 2013-08-16)
Google Scholar
Biodiversity occurrence data published by: Herbarium of University of Aarhus (Accessed through GBIF Portal, data.gbif.org, 2010-01-01; 2013-08-16)
Google Scholar
Biodiversity occurrence data published by: Ocean Biogeographic Information System (OBIS) (Accessed through GBIF Portal, data.gbif.org, 2010-01-01)
Google Scholar
Björneborn L, Ingwersen P (2004) Toward a basic framework for webometrics. J Am Soc Inf Sci Technol 55(14):1216–1227
Article Google Scholar
Bogers T, van den Bosch A (2011) Fusing recommendations for social bookmarking websites. Int J Electron Commer 15(3):31–72
Article Google Scholar
Chavan VS, Ingwersen P (2009) Towards a data publishing framework for primary biodiversity data: challenges and potentials for the biodiversity informatics community. BMC Bioinformatics 10(Suppl 14):11 s
Google Scholar
Garfield E (1979) Bradford’s law and related statistical patterns. current comments; essays of an information scientist 1979–1980, May 7 1979. pp 5–11
Google Scholar
GBIF.http://data.gbif.org/
Green T (2009) We need publishing standards for datasets and data tables. White paper OECD Publishing; 2009, 9–11. doi:10.1787/603233448430
Google Scholar
Ingwersen P, Chavan V (2011) Indicators for the Data Usage Index (DUI): an incentive for publishing primary biodiversity data through global information infrastructure. 15 Dec. 2011. BMC Bioinformatics 12(Suppl 15):S3, s. S3. 10 s
Google Scholar
Ingwersen P, Järvelin K (2005) The turn: integration of information seeking and retrieval in context. Springer
Google Scholar
Kurtz M, Bollen J (2010) Usage bibliometrics. Ann Rev Inf Sci Technol 44:3–64
Google Scholar
Moed HF (2005) Citation analysis in research evaluation. Springer, Dordrecht
Google Scholar
Moritz T, Krishnan S, Roberts D, Ingwersen P, Agosti D, Penev Y, Cockerill M, Chavan V (2011) Towards mainstreaming of biodiversity data publishing: recommendations of the GBIF Data Publishing Framework Task Group. 15 Dec. 2011. BMC Bioinformatics 12(Suppl 15):S1, 10 s
Google Scholar
Thelwall M, Vaughan L, Bjorneborn L (2005) Webometrics. Ann Rev Inf Sci Technol 39:81–135
Article Google Scholar

Download references

Author information

Authors and Affiliations

Royal School of Library and Information Science, University of Copenhagen, Birketinget 6, 2300, Copenhagen, DK, Denmark
Peter Ingwersen

Authors

Peter Ingwersen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peter Ingwersen .

Editor information

Editors and Affiliations

School of Information Management, Wuhan University, Wuhan, China
Chuanfu Chen
School of Information Sciences, University of Pittsburgh, Pittsburgh, Pennsylvania, USA
Ronald Larsen

Rights and permissions

This chapter is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.

Copyright information

About this chapter

Cite this chapter

Ingwersen, P. (2014). Scientific Datasets: Informetric Characteristics and Social Utility Metrics for Biodiversity Data Sources. In: Chen, C., Larsen, R. (eds) Library and Information Sciences. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54812-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-54812-3_8
Published: 01 October 2014
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54811-6
Online ISBN: 978-3-642-54812-3
eBook Packages: Business and EconomicsBusiness and Management (R0)

Publish with us

Policies and ethics