Does It Fit? KOS Evaluation Using the ICE-Map Visualization

Eckert, Kai; Ritze, Dominique; Pfeffer, Magnus

doi:10.1007/978-3-662-46641-4_36

Does It Fit? KOS Evaluation Using the ICE-Map Visualization

Kai Eckert²⁰,
Dominique Ritze²⁰ &
Magnus Pfeffer²¹

Conference paper
First Online: 01 January 2015

1273 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7540))

Abstract

The ICE-Map Visualization was developed to graphically analyze the distribution of indexing results within a given Knowledge Organization System (KOS) hierarchy and allows the user to explore the document sets and the KOSs at the same time. In this paper, we demonstrate the use of the ICE-Map Visualization in combination with a simple automatic indexer to visualize the semantic overlap between a KOS and a set of documents.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Hierarchical Knowledge Organization Systems (KOS), like thesauri, taxonomies, or other kinds of (lightweight) ontologies are widely used to describe all kinds of resources, large document corpora amongst others. In the Semantic Web, these KOSs are usually described in SKOS (Simple Knowledge Organization System^{Footnote 1}). The public availability of diverse KOSs on the web leads to new possibilities regarding the reuse of existing KOSs, but at the same time raises the question which KOS is suitable for the resources to be described. Thus, measuring the overlap of the subject coverage of a given document set and multiple KOSs is a necessary task before starting further, possibly time-consuming and costly efforts to annotate the documents. For these measurements, any one-dimensional analysis in the line of “summing the number of concepts that appear in the documents” is not sufficient. First, there is no baseline to compare the generated numbers with and second, all hierarchical information is lost and it is not possible to compare the results in their subject context. Instead, we propose to use a graphical visualization that preserves the hierarchical context as well as a statistical measure. The numerical results which are provided by the measure are intuitive to understand and well suited for a graphical representation. They are combined in the ICE-Map Visualization.

In this paper, we use the ICE-Map Visualization [2] to visualize the overlaps between KOSs and document sets. This visualization is based on a treemap and allows the user to browse a KOS hierarchy interactively. The colors indicate which parts of the KOS fit the documents. To create the visualization, the ICE-Map Visualization requires that the documents are annotated with KOS concepts. In the discussed use case, there are no annotations yet and manual assignment is obviously not feasible. Thus, it is necessary to automatically generate them. We show that the ICE-Map Visualization in combination with our automatic indexing approach is suitable to calculate and visualize the overlap between a KOS and a document set in a way that users can make informed decisions on whether the document set fits to the KOS.

2 Setup

We apply a KOS-based indexing approach to determine which concepts of the KOS occur in a given document. For this purpose, we developed a pure linguistic indexer called LOHAI [1] which is free and open source. It uses part-of-speech tagging, stemming, and word-sense disambiguation. It is especially important that the indexer does not rely on any additional knowledge sources and is kept simple to ensure usability as well as comprehensibility of the results. The reference implementation is available online^{Footnote 2}. The weighted concept annotations created by LOHAI form the basis for the ICE-Map Visualization.

The ICE-Map Visualization is an approach for visual datamining (VDM) specifically designed for the purpose of maintenance and use of concept hierarchies in various settings. In this paper, we use it to visualize the number of documents associated with the concepts in the thesaurus. The ICE-Map Visualization is described in detail by Eckert [2]. Here, we briefly recapitulate the basic idea and introduce the weight function employed in this paper.

The usage of a concept $c$ is determined by a weight function $w(c) \in \mathbb {R}^+_0$ that assigns a non-negative, real weight to it. Based on this weight function, we further define:

$$\begin{aligned} w^+(c)=w(c)+\sum _{c' \in \text {Children}(c)}w^+(c') \end{aligned}$$

(1)

with $\text {Children}(c)$ being the direct child concepts (narrower concepts) of $c$. $w^+(c)$ is a monotonic function on the partial order of the concept hierarchy $H$, i.e., the value never increases while walking down the hierarchy. This gives the value of the root node $\text {root}(c)$ a special role as the maximum value^{Footnote 3} of $w^+$, which we denote as $ \hat{w}^+$: $\hat{w}^+(c)=w^+(\text {root}(c)) = \max _H w^+(c)$.

If we use the number of annotations made for a given concept as the weight function $w(c)$, we can calculate the likelihood that a concept is assigned to a random document as follows^{Footnote 4}:

$$\begin{aligned} L(c)=\frac{w^+(c)+1}{\hat{w}^+(c)+1} \qquad \qquad L(c) \in (0,1] \end{aligned}$$

(2)

In information theory, the information content or self-information of an event $x$ is defined as $-\log L(x)$, i.e., a higher information content means a more unlikely event. Together with a normalizing factor, we get the following definition for the information content $IC(c) \in [0,1]$ of a concept $c$:

$$\begin{aligned} IC(c)= \frac{- \log L(c)}{ \log (\hat{w}^+(c)+1)}\qquad \qquad \hat{w}^+(c)\ne 0 \end{aligned}$$

(3)

This is again a monotonic function on the partial order of $H$ and assigns $0$ to the root concept and $1$ to concepts with $w(c)=0$. The ICE-Map Visualization always visualizes the difference of two information contents based on two different weight functions or two different data sets: $D(c)=IC_1(c)-IC_2(c)$. The power of the ICE-Map Visualization lies in the possibility to choose arbitrary weight functions for $IC_1$ and $IC_2$. To calculate the weight of a concept regarding its usage in a document set, we use:

$$\begin{aligned} w_{1}(c) = \sum _{a \in \text {Aset}(c)} \text {Weight}(a)\end{aligned}$$

(4)

with $\text {Weight}(a)$ denoting the weight of a single annotation $a$ as calculated by LOHAI^{Footnote 5} and $\text {Aset}(c)$ being the set of annotations assigned to a concept $c$.

To evaluate the suitability, we compare the information content based on Eq. 4 to the intrinsic information content [3] – a heuristic for the expected information content of a concept based on its position in the hierarchy. In our statistical framework, we obtain the intrinsic information content by employing the following weight function:

$$\begin{aligned} w_{2}(c) = \left| \text {Children}(c)\right| \end{aligned}$$

(5)

The ICE-Map Visualization uses a treemap to visualize the concept hierarchy together with the results of the analysis. It gives a broad overview of the whole document set with the annotated concepts and supports zooming and navigating the hierarchy of the KOS to get a detailed view. The automatic indexer LOHAI and the ICE-Map Visualization are included in our KOS analysis software SEMTINEL^{Footnote 6}.

3 Experiments

To demonstrate the usefulness of the ICE-Map Visualization together with LOHAI to measure the suitability of thesaurus and document collection, comprehensive document sets and KOSs are needed. The KOSs need to have a significant overlap without describing the same topic and we also need at least one document set for each KOS where we can assume that it fits to the KOS. Furthermore, we would prefer to use well-established KOSs that are freely available and widely used. They need to have a significant size and at least one language in common.

For the experiments we chose TheSoz^{Footnote 7} (Thesaurus for the Social Sciences) version 0.86 and STW^{Footnote 8} (Standard Thesaurus Wirtschaft) version 8.08 in our experiments. Both KOSs are available as SKOS vocabularies and have a comparable size of about 7000 concepts with English labels. While TheSoz covers all social science disciplines, STW focuses on economical topics. As document sets, we apply SSOAR^{Footnote 9} and EconStor^{Footnote 10}. SSOAR as well as EconStor are open-access servers, maintained by GESIS and ZBW, respectively, the organisations that also publish the KOSs. Of both sets, we take a random subset of 2700 documents to ensure comparable results. As for the KOSs, SSOAR has its focus on social science and EconStor on economy. Despite of some deviations, we can assume that SSOAR naturally fits to TheSoz and EconStor fits to STW.

In Fig. 1, we show the resulting visualization for all combinations of KOSs and document sets. The coloring represents the value of the weight function. It ranges from blue which means the weight for this concept is really low over white and finally to red which indicates a very high weight, compared to the reference weight determined by the heuristic. This economical bias of Econstor can clearly be seen in Fig. 1a since most concepts which are used in the documents are narrower concepts of Economy ①. In contrast, the results of SSOAR/TheSoz (Fig. 1b) do not point out such a clear focus on one specific field. It is interesting that Economy is still very visible, an indicator that both sciences indeed have an overlap reflected in the document sets. Moreover, the General Terms section ③ is used similarly by both document sets. When the STW is used as KOS, it can be seen in Fig. 1c that EconStor documents contain concepts of several parts (especially Economy ①) while SSOAR documents use concepts which are narrower ones of Related Subject Areas and especially of Sociology (Fig. 1d, ②). Other parts that are used well by both document sets are again general parts like Geographical Terms ③ and General Terms ④. All in all, the semantic overlaps of the document sets with the KOSs are clearly visible. Without any further information, we evaluated two document sets and two KOSs and were able to develop a deeper understanding of them by just browsing through the ICE-Map Visualization.

4 Conclusion

We presented an approach to visualize the semantic overlap of a KOS and a document set. We combined the ICE-Map Visualization with a very simple automatic indexer called LOHAI. We chose two KOSs and two document sets with a significant topical overlap to demonstrate the usefulness of our approach. Based on the resulting visualization, we could show that it is possible to identify whether KOS and document set topically fit together. Thus, the choice of a suitable KOS or the maintenance of an already used KOS is strongly simplified.

Notes

1.
http://www.w3.org/TR/skos-reference/.
2.
https://github.com/kaiec/LOHAI.
3.
The root node is defined as the only concept $c$ in $H$ for which holds that $\text {Parents}(c)=\emptyset $. Note that we require $H$ to have a single root concept. Otherwise, we introduce an artificial single root concept that becomes the parent of all former root concepts.
4.
The addition of $1$ is necessary to allow a value of $0$ for $w(c)$. Otherwise, the logarithm of $L(c)$ (cf. Eq. 3) would not be defined for $w(c)=0$.
5.
Strictly speaking, from an information-theoretic perspective, this function interprets the tf-idf weight of the annotation as the likeliness of being an annotation for the document. This interpretation is not correct, as tf-idf is no probability value.
6.
http://www.semtinel.org/.
7.
http://lod.gesis.org/thesoz/.
8.
http://zbw.eu/stw/versions/latest/download/about.en.html.
9.
http://www.ssoar.info/.
10.
http://www.econstor.eu/.

References

Eckert, K.: LOHAI: Providing a baseline for KOS based automatic indexing.In: Proceedings of the First International Workshop on Semantic Digital Archives (SDA) at the International Conference on Theory and Practice of Digital Libraries (TPDL) 2011, Berlin, 29 September 2011
Google Scholar
Eckert, K.: The ICE-Map Visualization. Technical Report TR-2011-003, University of Mannheim, Department of Computer Science (2011)
Google Scholar
Seco, N., Veale, T., Hayes, J.: An intrinsic information content metric for semantic similarity in wordnet. In: Proceedings of the 16th European Conference on Artificial Intelligence, Valencia, Spain, pp. 1089–1090 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Mannheim, University Library, Mannheim, Germany
Kai Eckert & Dominique Ritze
Stuttgart Media University, Stuttgart, Germany
Magnus Pfeffer

Authors

Kai Eckert
View author publications
You can also search for this author in PubMed Google Scholar
Dominique Ritze
View author publications
You can also search for this author in PubMed Google Scholar
Magnus Pfeffer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kai Eckert .

Editor information

Editors and Affiliations

University of Southampton, Southampton, United Kingdom
Elena Simperl
British Museum, London, United Kingdom
Barry Norton
Ljubljana, Slovenia
Dunja Mladenic
DEIB - Politecnico di Milano, Milano, Italy
Emanuele Della Valle
Foundation for Research and Technology Hellas (FORTH), Heraklion, Greece
Irini Fundulaki
MDG Web Limited, Dublin, Ireland
Alexandre Passant
Multimedia Communications Department, EURECOM, Campus SophiaTech, Biot, France
Raphaël Troncy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Eckert, K., Ritze, D., Pfeffer, M. (2015). Does It Fit? KOS Evaluation Using the ICE-Map Visualization. In: Simperl, E., et al. The Semantic Web: ESWC 2012 Satellite Events. ESWC 2012. Lecture Notes in Computer Science(), vol 7540. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-46641-4_36

Download citation

DOI: https://doi.org/10.1007/978-3-662-46641-4_36
Published: 21 April 2015
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-46640-7
Online ISBN: 978-3-662-46641-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics