Intra-list similarity and human diversity perceptions of recommendations: the details matter

Jesse, Mathias; Bauer, Christine; Jannach, Dietmar

doi:10.1007/s11257-022-09351-w

Intra-list similarity and human diversity perceptions of recommendations: the details matter

Open access
Published: 12 December 2022

Volume 33, pages 769–802, (2023)
Cite this article

Download PDF

You have full access to this open access article

User Modeling and User-Adapted Interaction Aims and scope Submit manuscript

Intra-list similarity and human diversity perceptions of recommendations: the details matter

Download PDF

Mathias Jesse¹,
Christine Bauer² &
Dietmar Jannach¹

3308 Accesses
3 Citations
2 Altmetric
Explore all metrics

Abstract

The diversity of the generated item suggestions can be an important quality factor of a recommender system. In offline experiments, diversity is commonly assessed with the help of the intra-list similarity (ILS) measure, which is defined as the average pairwise similarity of the items in a list. The similarity of each pair of items is often determined based on domain-specific meta-data, e.g., movie genres. While this approach is common in the literature, it in most cases remains open if a particular implementation of the ILS measure is actually a valid proxy for the human diversity perception in a given application. With this work, we address this research gap and investigate the correlation of different ILS implementations with human perceptions in the domains of movie and recipe recommendation. We conducted several user studies involving over 500 participants. Our results indicate that the particularities of the ILS metric implementation matter. While we found that the ILS metric can be a good proxy for human perceptions, it turns out that it is important to individually validate the used ILS metric implementation for a given application. On a more general level, our work points to a certain level of oversimplification in recommender systems research when it comes to the design of computational proxies for human quality perceptions and thus calls for more research regarding the validation of the corresponding metrics.

How good your recommender system is? A survey on evaluations in recommendation

Article Open access 14 December 2017

Novelty and Diversity in Recommender Systems

1 Introduction

The main task of a recommender system is to surface items that are relevant for users in their current context. However, it is well known that in many cases, being accurate in terms of predicting which items are relevant may not be enough (McNee et al. 2006). One established additional quality criterion that is important in many application domains is that of diversity. Non-diverse recommendation lists may not only appear monotone to users, but they may also lead to limited discovery if they only cover a limited part of the available catalog, e.g., only movies from the most preferred genre. Therefore, various algorithmic approaches were proposed over the years to ensure a certain level of diversity in the recommendations, see Kaminskas and Bridge (2016) and Kunaver and Požrl (2017) for related surveys.

As diversity may be considered a subjective concept, several user studies focused on understanding in which ways the perceived diversity of a set of recommendations may affect other quality factors such as perceived accuracy (Pu et al. 2011; Ekstrand et al. 2014; Willemsen et al. 2016; Nilashi et al. 2016). However, probably a much larger number of publications on diversity uses offline experiments as a research methodology and therefore rely on objective, computational metrics to quantify the extent of diversity of a given recommendation list. A very common approach to quantify a list’s diversity is to consider the pairwise similarities of the items. Early proposals were made over 20 years ago in the context of retrieval-based and conversational recommenders (Bradley and Smyth 2001; McGinty and Smyth 2003). Soon after, Ziegler et al. (2005) popularized this approach under the term intra-list similarity (ILS) in their early work on topic diversification in a book recommendation setting.

Technically, the ILS of a set P of (recommended) items is defined in Ziegler et al. (2005) as follows:

$$\begin{aligned} \hbox {ILS}(P) = \frac{ \sum _{p_i\in P}^{} \sum _{p_j \in P, p_i\ne p_j}^{} sim(p_i,p_j)}{ 2 } \end{aligned}$$

(1)

where sim is an arbitrary function that returns a similarity score for two items. Often, the score is standardized to lie in $[-1,1]$ or [0, 1]. Note that in Ziegler et al. (2005) the sum of the pairwise similarities is divided by two. Reporting the average pairwise similarity, as done in earlier in Bradley and Smyth (2001), is however more common today, and the denominator in this case would be the number of comparisons, i.e., $(|P|(|P|-1))/2$.

One important aspect of the technical formulation of the ILS metric is that the order of the elements in a list does not matter. The same ILS value will be returned when all similar items are dispersed across the list or when they are clustered, e.g., at the beginning or end of the list.^{Footnote 1} Another feature of the ILS metric definition is that it is generic in a sense that is not tailored to a particular application setting. Depending on a specific application, any suitable similarity function can be plugged in. Ziegler et al. (2005) used Amazon’s book taxonomy to diversify the results of their recommender. Later works relied on various other types of (meta-)data, for example, movie genres (Vargas et al. 2012), food ingredients (Hauptmann et al. 2021), artist similarity based on social tags (Jannach et al. 2017), or latent topic models from the users’ interactions (Shi et al. 2012).

In many works on diversity in recommendation, the rationale for using a particular similarity function is however not discussed in depth, and it might simply be based on the availability of item meta-data. In this regard, we may therefore face an oversimplification in terms of the selection and operationalization of diversity metrics. Most works come without an evaluation of the selected diversity metric—neither for a specific application nor across settings or domains. Instead, it is simply assumed—often without evidence or theoretical underpinning—that the chosen computational metric is aligned with human perceptions. Moreover, while many published works may show that a particular diversity-aware algorithm has an effect on the chosen ILS metric, it is often unclear if the algorithm would actually impact the users’ diversity perception. Understanding the users’ perception is however important because it may significantly affect the quality perception of the recommender system and the behavioral intentions of users, as mentioned above (Pu et al. 2011; Ekstrand et al. 2014; Willemsen et al. 2016; Nilashi et al. 2016).

In this work, we address this largely open research gap and investigate to what extent different ILS metric implementations are suitable proxies for the diversity perception of users. For that purpose, we conducted a number of user studies in two application domains, involving over 500 participants. In these studies, we presented the participants with recommendation lists that had different diversity levels according to a particular ILS metric, and we then contrasted the ILS-based diversity values with the participants’ self-reported diversity perceptions. Our studies led to two main insights. First, we find that ILS can be a valid proxy for user-perceived diversity. In both application domains, we found a metric implementation that correlated well with user perceptions. Second, however, we found that the particularities matter: using different metric implementations for diversification results in varying diversity perceptions, and what works well in one domain does not necessarily work well in another. Overall, our work therefore confirms our conjecture regarding a certain level of oversimplification in our research practices when it comes to studying diversity-aware recommendation algorithms. Correspondingly, more research seems to be needed so that future research in this area can build on validated metrics.

The remainder of the paper is organized as follows. After discussing previous works in Sect. 2, we describe the details of our user studies in Sect. 3. The results are presented and discussed in Sect. 4. We conclude our work with a summary of the main findings and an outlook on future research in Sect. 5.

2 Related work

The main concepts of interest in our study are diversity metrics for recommendation lists and the diversity perceptions of humans. In terms of terminology, let us note upfront that “similarity” and “diversity” are often considered to be inversely related concepts in the literature and that the terms are sometimes used in an almost interchangeable manner. Technically, as mentioned above, the diversity of a list is commonly computed as a mathematical inverse of a metric that is based on similarity. In user-centric evaluations, by comparison, both questions related to the perceived similarity and the perceived diversity are sometimes used to assess the diversity of a list, e.g., in Pu et al. (2011).

In the following sections, we first present related work on intra-list similarity (ILS), which is the central metric in our work, and discuss other similarity metrics used in recommender systems research (Sect. 2.1). Thereafter, we review previous works that studied human perceptions of similarity and diversity. First, we discuss works that focus on the similarity perception of item pairs (Sect. 2.2); subsequently, we present related work that focuses on the diversity perception of lists of items (Sect. 2.3).

Overall, our present work is different from the discussed prior works in that we aim to validate whether commonly used metrics to assess the diversity of lists of items are indeed valid proxies of human perceptions. For that purpose, we investigate to what extent the particularities of the specific implementation of ILS matter for two important application domains.

2.1 Intra-list similarity and other diversity metrics

ILS is probably the most commonly used metric to capture diversity in recommender systems in the literature (Du et al. 2021). As described earlier, ILS is based on pairwise similarity comparisons, and a higher ILS score denotes a lower level of diversity. Other names for ILS are intra-list diversity (Vargas et al. 2014; Mauro and Ardissono 2019) or intra-list distance (Lin et al. 2020).

The generic ILS definition provides the flexibility to plug in any suitable similarity function. In the literature, a variety of similarity functions was used for different applications, using different types of content information or meta-data. Kaminskas and Bridge (2016) point out that there is no guarantee that lists with a high-ILS metric value are also perceived as highly similar. When using a particular implementation of the ILS metric, it is in principle necessary to validate that it is indeed a good proxy for user perceptions. This validation is, however, rarely done and our present research addresses this question for two ILS implementations.

We note that—besides ILS—also various other metrics are used to capture diversity in recommender research. Kunaver and Požrl (2017) provide a detailed literature review of diversity metrics. However, it turns out that several of these metrics are variations or extensions of each other, which means only a small number of distinct metrics are actually used in the literature.

An example of an approach to capture a quite different concept of diversity is to use the Gini coefficient. The Gini coefficient is a measure of distributional inequality and is, for example, used in economics to capture income inequality. In the context of recommender systems, the coefficient can be used to measure how often individual items are recommended (and subsequently more likely purchased). If a recommender system makes the same item suggestions to everyone, the distribution will be very skewed, leading to a high Gini coefficient and a concentration bias on a few items (Fleder and Hosanagar 2007)^{Footnote 2}. Differently from the ILS metric, diversity assessments based on the Gini index are not based on the analysis of individual lists, but on how often individual items appear in recommendation lists across users, which is why it is called “aggregate diversity” in Adomavicius and Kwon (2012). Questions of aggregate diversity, concentration biases, and related concepts of coverage—see Jannach et al. (2015) for an in-depth analysis—are not the focus of our present work, which aims to study human diversity perceptions at the level of individual lists.

In many research works on recommendation diversity, the goal is to balance the typical trade-off between accuracy (relevance) and diversity metrics, see Zheng and Wang (2022) and Jannach (2022) for related surveys on multi-objective recommender systems. An alternative to such approaches is to design evaluation metrics that combine various aspects, including accuracy, diversity, or novelty in a single metric. In the area of information retrieval, Clarke et al. (2008) for example proposed to consider various such aspects when computing the nDCG (Normalized Discounted Cumulative Gain) of an item ranking. Later on, similar ideas were proposed for recommendation problems—e.g., in Vargas (2011) and Vargas and Castells (2011)—to design relevance-aware beyond-accuracy metrics. Technically, such metrics are however again often based on the ILS metric. In our present work, we focus on human diversity perceptions independent of the individual relevance of the items for a given user.

Overall, while the review by Kunaver and Požrl (2017) shows that there are several alternative approaches to computationally assessing the different notions of diversity, the ILS is the most frequently used metric in the literature, and we therefore use it as the basis for our research.

2.2 Assessing the similarity perception of item pairs

In the context of recommender systems, similarity functions play a central role in different ways. In content-based recommendation approaches in general, similarity functions serve as the basis to assess the match between a given item and a user’s past preferences (de Gemmis et al. 2015). Also, similarity functions are commonly a main component that determines the item ranking for the problems of similar-item recommendation (Brovman et al. 2016) and next-item recommendation (Zeng et al. 2019). Moreover, similarity functions are typically the foundation of diversification approaches (Kunaver and Požrl 2017; Ziegler et al. 2005; Vargas and Castells 2011; Chen et al. 2013), where the goal is to match the users’ diversity needs or to support their exploration efforts (Tsai and Brusilovsky 2018).

For all such purposes, it is important that the chosen similarity function reflects user perception, an aspect that usually has to be validated through corresponding user studies, see, e.g., Ekstrand et al. (2014). In the following, we review works that studied user perceptions based on pairwise item similarity judgments.

In Colucci et al. (2016), participants judged the similarity of movies in pairwise comparisons using a binary judgment (yes–no). For 62 % of the evaluation pairs, there was complete consensus among the participants. Yet, these human judgments were only partly aligned with the output of three algorithmic similarity functions (with the highest precision value being only .55). Building on this dataset, Wang et al. (2017) designed two content-based recommendation approaches, where one considered human perceptions in the recommendation process whereas the other did not. Their experiments showed that users indeed preferred the recommendations that considered the human similarity judgments. The results by Colucci et al. (2016) and Wang et al. (2017) indicate that humans largely agree in their similarity perceptions of item pairs, while these perceptions are not aligned with algorithmic similarity functions. Further, their works demonstrate that different similarity functions result in discrepancies between objective similarity measures and human perception. Differently from their works, which focus on pairwise item similarities, our work focuses on the perception of entire lists and the correspondence with the ILS measure.

The movie domain was also the focus of the work by Yao and Harper (2018). In their work, the authors evaluated similarity scoring algorithms in terms of how well they reflect the users’ perceptions. To this end, study participants had to rate the similarity of movie pairs which were selected with six different similarity scoring algorithms spanning a range of activity- and content-based approaches. The results suggest that content-based approaches to defining the similarity of movies best reflect the users’ perception of similarity.

Trattner and Jannach (2019) studied in more depth how specific item features (e.g., title, plot, movie poster)—when used in a content-based similarity function—correlate with the similarity of items as perceived by users. Differently from Yao and Harper (2018), their study design required participants to compare a reference movie with a list of movies. The similarity measure based on tags reflected user perceptions well, which was also shown in Yao and Harper (2018). Moreover, capturing similarity in the latent space using matrix factorization proved to be particularly powerful as it did not only reflect user perceptions well but was also the approach that led to the highest usefulness scores in terms of the participants’ interest in trying out a movie recommendation. In our present work, we therefore rely on latent item representations when comparing items in two of our studies. Moreover, as this approach is domain-independent, it allows us to make a cross-domain comparison.

Human similarity judgments were also central to the work by Lee (2010) in the music domain. In their work, the authors collected human judgments on how “musically similar” pairs of songs were via Amazon’s Mechanical Turk platform, and they then compared those judgments with a ground truth of expert judgments. One main finding of their work was that crowdsourcing—as also done in our present study—can be considered a reliable source for music similarity judgments. Some earlier work in the music domain (Ellis et al. 2002), however, indicated that finding a computational metric that gives “reasonable agreement” with the human judgments can be challenging. And Downie et al. (2007) found that providing participants with a similarity definition (here: “musically similar” or “melodically similar”) influences the similarity judgments.

In the food and recipe domains, van Pinxteren et al. (2011) employed a card-sorting approach to identify ingredients, preparation techniques, cuisine, meal type, and preparation time as the most relevant characteristics that determine the similarity of recipes. In Trattner and Jannach (2019), recommendations based on the recipe instructions, title, and ingredient lists, as well as a combined model, led to the highest similarity perception. In one of our studies in the recipe domain, we, therefore, also rely on a combined approach where we consider ingredients and cooking instructions as item meta-data.

Finally, in the news domain, Starke et al. (2021) compared several similarity functions (based on title, body text, image features) and concluded that using the articles’ body text for capturing similarity between news articles comes closest to the human similarity perception. Yet, in this domain, the similarity functions were overall shown to be much weaker than in the recipe and the movie domain. Thus, they suggest to use other features than body text only in case the used similarity function is specifically adapted the news domain. Their cross-domain comparison suggests that in terms of capturing similarity, the news domain is closer to the movie than the recipe domain; although in general the news domain may require similarity functions that are less “taste-related” than in the recipe and movie domains.

2.3 Assessing the diversity perception of item lists

While several works have addressed similarity perception, only a few have specifically addressed the diversity perception of lists. In the music domain (specifically, electronic music), for example, Porcaro et al. (2022) found that instrument and samples, sub-genre or sub-style, tempo, and mood strongly influence what track lists were considered diverse. On the artist level, the artists’ origin and nationality, gender, and skin tone were considered key factors for diverse music lists. They also found that the used metrics reflect the diversity perceptions particularly well for participants coming from Western and educated societies and in the age range between 18 and 35 years.

Although primarily studying similarity perception, Trattner and Jannach (2019) also address how similarity perception relates to diversity perception—specifically, the perception of list diversity. For both the movie and recipe domain, the item features determining perceived list diversity were found to be similar to those determining perceived similarity. For movies, the matrix factorization-based approach was a particularly useful approach to capture users’ perception of list diversity. Similarly, title, movie poster, plot, and genre were useful features in a content-based approach. In the recipe domain, the image of the dish as the basis for the similarity function reflected the users’ list diversity perception best, followed by title, instructions, and the ingredients list. While the work of Trattner and Jannach (2019) and our present work address related topics and adopt similar research methodology, there are stark differences between the two works. Trattner and Jannach (2019) aimed to find ways to construct similar-item recommendations in a reliable way. Specifically, they investigate which item features determine the perceived similarity of a given pair of items. Our work, in contrast, aims to validate if commonly used metrics for lists of items are suitable proxies for human perceptions.

Overall, the examined prior work indicates that not all similarity functions correlate equally well with the similarity perceptions of users. Thus, the findings in the literature support the hypothesis investigated in our present work that the specifics of how a similarity metric is implemented matter.

3 Experimental design

Goals of Studies We recall that the goal of our research is to assess and validate to what extent different ILS metrics correlate with the users’ perception of diversity in two popular application domains of recommender systems, movies, and recipes. Further, based on the observations by Ge et al. (2011), we investigate whether the items’ order within a list impacts the users’ perceptions.

Overview of Studies Overall, we conducted four complementary studies to address these aspects. Essentially, in each of the studies, the participants were shown recommendation lists with different levels of ILS and asked to report their diversity perceptions. Specifically, we studied the users’ perceptions when (i) latent item representations were used to compare items and when (ii) application-specific similarity measures were applied. For both, we executed the studies in the domains of movies and recipes. Figure 1 shows an overview of our four studies: ${Study{\text {-}}1_{movies}}$ and ${Study{\text {-}}1_{recipes}}$ rely on latent item vectors and involve nine different list types with varying ILS levels and item orders as our manipulated variables. ${Study{\text {-}}2_{movies}}$ and ${Study{\text {-}}2_{recipes}}$ use domain-specific ILS metrics. In this second set of studies, we focus only on varying the ILS levels (low, mid, high), which leads to three list types per domain. In all studies, each list shown to users contained seven items. We considered larger list sizes to be cognitively too challenging for study participants, cf. Miller (1956), Jensen and Lisman (1996).

In our experiment, we employ a mixed design. Each participant rated three lists, which constitutes the within-subject element of the design. While all participants rated three lists, the selection of the lists was randomized and, thus, the set of lists varied across participants, which constitutes a between-subjects element of our design.

Subsidiary Research Questions While the main focus of our work is validating the ILS metric, we also designed the studies to answer relevant subsidiary questions. First, we addressed whether the familiarity of items influences the perception of a list. Similar to Porcaro et al. (2022), where domain knowledge played a role in diversity perception, participants without background knowledge regarding a specific set of items might perceive the list differently (e.g., more diverse) than participants with more knowledge. Second, we addressed whether different diversity levels impacted the participants’ decision processes, e.g., in terms of their perceived choice difficulty or choice confidence, cf. Pu et al. (2011), Ekstrand et al. (2014). Third, we investigated if the popularity of the items had any impact on the perception of the lists. As observed in other works (e.g., Abdollahpouri et al. 2019), recommender systems may face the problem of popularity bias, which could eventually alter the way users perceive the quality of the recommendations. Fourth, we checked for any gender-specific differences in the responses of the participants. Lastly, we also explored what would happen if participants were presented with only one pair of very dissimilar items, compared to presenting an entire list with high diversity.

3.1 Creating diverse recommendation lists

Creating Lists with Varied Diversity Levels To create recommendation lists of different diversity levels, we followed an approach which is both repeatable and as free as possible from potential researcher bias. The general idea of our approach is to first create a set of k recommendation lists for a randomly selected set of users from a given dataset using a matrix factorization approach.^{Footnote 3} Then, we compute the ILS values of these recommendation lists to obtain an estimate of the distribution of the ILS values for a given dataset.^{Footnote 4}

Based on this approximated distribution, we then select three lists of the k sampled ones, which we then consider as representations of being of low, medium, and high diversity. In our case, we simply used the lists with the lowest ILS and the highest ILS to serve as representatives for a low-ILS and a high-ILS list, respectively. To determine a mid-ILS representative list, we computed the mean of the just discussed highest and lowest ILS values. Then, we picked the one recommendation list from the ten samples, which was closest to this mean ILS value.

In addition to these three lists, we designed two more special lists for our experiments for each domain.

The first of them, which we call popsim, was created by randomly picking one of the 20 most popular items in each dataset and then determining their most similar items among the 100 most popular items. We created this list to study the diversity perception of lists that contain popular similar items, which—at least in the movie domain—are also often items that users are familiar with. We note that popular item recommendations are commonly used as a simple baseline in research works. A validation of this baseline approach is therefore a particularly useful reference point for works on recommender systems algorithms.
The second list, which we named upp, serves as a proxy for an upper bound for ILS values that we may realistically observe in practice. In the movie domain, we selected a collection of sequels for this purpose (“Batman”); in the recipe domain, we manually picked a set of highly similar recipes for muffins. With upp, we created lists with a set of items with a close-to-maximum ILS value.

This process gives us five lists with different levels for the ILS values. An overview of these lists is given in Table 1.

Table 1 Overview of how recommendation lists of increasing similarity were created

Intra-list similarity and human diversity perceptions of recommendations: the details matter

Abstract

Similar content being viewed by others

How good your recommender system is? A survey on evaluations in recommendation

Novelty and Diversity in Recommender Systems

Novelty and Diversity in Recommender Systems

1 Introduction

2 Related work

2.1 Intra-list similarity and other diversity metrics

2.2 Assessing the similarity perception of item pairs

2.3 Assessing the diversity perception of item lists

3 Experimental design

3.1 Creating diverse recommendation lists

3.2 Experiment flow and details

3.3 Participants

4 Results

4.1 RQ1: Correspondence of ILS and human diversity perception

4.1.1 Results for Phase 1: using latent item representations

4.1.2 Results for Phase 2: using domain-specific meta-data

4.2 RQ2: Impact of item order on user perceptions

4.3 RQ3: Criteria that determine similarity assessments

5 Conclusions and future work

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix: Additional material

Appendix: Additional material

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation