Keywords

1 Introduction

Citizen science (CS) platforms represent a powerful tool allowing participants to contribute to research and increase their scientific knowledge. Furthermore, CS platforms help scientists in their research projects, by collecting more data and analyzing it. Generally, the primary goal of CS platforms is connecting many participants, experts, and researchers towards a common scientific goal. Nowadays, numerous CS platforms have emerged and can be classified according to their scientific objectives such as: medicine, ecology, astronomy, computer science, psychology, etc. Many popular CS platforms with large communities of participants exist today, such as ZooniverseFootnote 1, FolditFootnote 2, EyewireFootnote 3, and eBirdFootnote 4. Zooniverse benefits from the collaboration from more than 1 million registered users to analyze pictures of distant galaxies. Foldit allows users to fold the structures of selected proteins as correctly as possible, by playing an online puzzle video game. Eyewire challenges players to map neurons in 3D, by solving 2D puzzles, thereby helping researchers to model information processing circuits. eBird collects bird information from many volunteers, to provide data about bird distribution and abundance in real-time. Similarly to eBird, SPIPOLLFootnote 5 allows users to take photos of flowering plants and their pollinating insects to study changes in pollinator assemblages across space and time. However, most of the existing CS platforms still lack an expert finding (EF) mechanism, which could improve the quality of collected data and optimize data evaluation time. EF approaches aim to extract a list of experts with high knowledge and expertise in a specific domain, to produce high quality answers to questions from online communities. Most of these approaches were focused on communities question answering (CQA) websites. Unlike the existing EF approaches, our study deals with the problem of EF in online CS platform on biodiversity, with the SPIPOLL as a study case. In the SPIPOLL, after taking pictures of pollinators on flowers, the users give a name to each photographed insect from 600 possibilities and share their photos and associated insect names on the platform. While users can comment on each other observations and identifications, experts validate or correct the pollinator identifications.

In our approach, we analyze the users comments and extract the comments that contain precise identifications. The extracted comments will be considered as answers and will be used to construct the users social network. A weighted PageRank algorithm will be applied on the obtained network, to calculate the users expertise for a specific insect family. This paper is organized as follows: Section 2 provides an overview of the related work in the area of EF in CQA websites. Section 3 presents the general structure of the SPIPOLL website. Section 4 introduces the details of our proposed EF approach. Section 5 describes the experimental setup and obtained results. Finally, we provide some concluding remarks in Sect. 6.

2 Related Works

CQA websites represent a powerful tool of knowledge mining on specific topics which cant be extracted easily from general web search engines. CQA websites allow online users to post and answer questions and exchange knowledge among them. Nowadays, several CQA platforms have emerged, such as QuoraFootnote 6, Yahoo AnswersFootnote 7, BlurtitFootnote 8 and Stack OverflowFootnote 9. With the increase of these platforms, the task of EF has received significant attention in the literature. EF aims to find the appropriate users or experts who can provide good quality answers for posted questions. Many research fields can benefit from EF techniques, such as questions recommendation [16] and spam detection [3, 6]. For CQA websites, several approaches have been proposed, which can be classified into three main categories: 1- graph-based EF approaches, 2- content-based EF approaches and 3- competition-based EF approaches. In graph-based EF approaches, the users’ network is represented by a directed graph, where nodes represent users, and edges represent the relationships among them. A link from user A to user B is drawn, if the user B answers for question posted by user A. The user expertise score can be estimated from the number of edges pointing on him. Most of existing works in this category have adopted link analysis algorithms like PageRank [13] or Hits [8], to calculate the users’ expertise scores. We provide in what follows a brief review of such approaches: Zhang et al. [21] proposed a new experts ranking algorithm, named ExpertiseRank. This algorithm is based on PageRank algorithm and calculates the expertise of each user according to the expertise of others related users to him. Li et al. [9] combined documents quality, documents topic-focus degree and users’ activities to calculate the users’ expertise rank. A social network analysis (SNA) algorithm has been used to analyze the links between the discovered experts, to obtain the specific experts for a specific topic. Zhao et al. [23] exploited the online social relations between users via graph regularized matrix to find experts in CQA systems. Zhao et al. [22] proposed a novel ranking metric network learning framework for EF by exploiting both the social interactions between users and users’ relative quality rank to given questions. Rafiei et al. [15] proposed a hybrid method for EF based on content analysis and SNA. The content analysis is based on concept map and SNA is based on PageRank algorithm. Wei et al. [18] proposed the ExpRank algorithm, an extension of the PageRank algorithm. In this algorithm, the negative and the positive agreements relations between users have been both exploited to calculate their expertise. Yeneterzi et al. [20] exploited topic-relevant users and the interactions between them, to construct topic specific authority graph, called Topic-Candidate (TC) graph. This graph has been used to estimate the topic-specific authority scores for each user. Zhu et al. [25] exploited the information in both relevant and target categories, to improve the quality of authority ranking. Procaci et al. [14] proposed a new approach for EF in online communities based on graph ranking algorithm and information retrieval approach. In this approach, two machine learning techniques, artificial neural network, and clustering algorithm have been exploited for EF. Dom et al. [5] applied a graph-based algorithm to rank email correspondents according to their degree of expertise on specific topics. Their results showed that PageRank algorithm performs better than all other algorithms. Shen et al. [17] used a weighted HITS algorithm for computing users reputation and recommending the obtained experts to the users who have posted questions. Content-based EF approaches analyze the extracted information from the users’ answers to predict their expertise. User expertise score can be estimated from his Z-score [21], his answers’ quality [24], his expertise domains [7] or his answers voted score [4]. Competition-based approaches suppose that the best answerer has higher expertise than other answerers for a question. To achieve that, they explore the pairwise comparisons between users (players) deduced from best answer selections, to estimate user expertise score. The resulting pairwise comparisons can be considered as two-players competition. Liu et al. [10] applied two-players competition models to determine the relative expertise score of users. Aslya et al. [2] proposed a novel community expertise network structure, by creating relations among the best answerer and other answerers they have beaten. The EF process is based on the principle of competition among the answerers of a question. In this work, unlike the existing graph-based EF approaches, we take into account the relationship degrees between users. We represent the interactions between users by a weighted graph. Then, we apply a weighted PageRank algorithm on this graph to estimate the users’ expertise. Details of the proposed method will be described in Sect. 4.

3 The General Structure of the SPIPOLL

SPIPOLL is an SC platform created by the National Museum of Natural History (MNHN) and the Office for Insects and their Environment (Opie), to collect data on flowers and their insect pollinators within metropolitan France. The collected data improve the users’ knowledge about insect pollinators and allow scientists to assess the abundance variations of pollinator communities. In the SPIPOLL, each user (observer) is asked to take pictures of all insects visiting chosen flowering plant, for a given period of time. Observers are then asked to identify insects and flowering plants, using an online identification key. The pictures of insects and flowering plant from an observation session, as well as their identification, are then uploaded on the SPIPOLL website to form a photographic collection. Nowadays, the SPIPOLL database contains more than 31329 photographic collections and 307719 insects’ pictures. Finally, the identifications will be validated by a small group of entomologists from the OPIE. In the SPIPOLL, users can also comment pictures and collections, and add doubts in the identified photos if they aren’t sure about identifications.

However, with the increase of collected pictures in the SPIPOLL, the limited number of current experts is insufficient to validate all identifications. Therefore, we propose a novel approach to identify expert within the users for specific insect family based on the users’ comments. The comments which contain precise identifications will be considered as answers. Each answer will be compared to corresponding validation (the correct identification validated by experts) to verify its reliability. In other word, we know what the true identification is and we then search for comment that gave the right answer with no ambiguity. In the SPIPOLL, all data will be eventually validated as correct identification which is a prerequisite for ecological analysis.

4 The Proposed Approach

In our approach, we exploit both comments (answers) quality and social interactions between users to predict their expertise. In our weighted graph model, users are represented as nodes, related among them by weighted directed edges. Each edge points from the questioner (the observer) to the answerer (the commentator). The edges weights are calculated according to the reliability of exchanged answers between users. We consider the comments that contain a precise identification (the exact name of the insect) as answers and the posted pictures as questions which wait for identifications (answers). An answer is considered its identical to the validation. Finally, we estimate the users’ expertise, by applying a weighted PageRank algorithm on the graph representing the network of questions and answers among users. Our proposed approach can be summarized as follows: 1- Merging users comments on pictures and collections. 2- Extracting precise identifications from comments, using text analysis technique. 3- Extracting the comments with precise identifications (CPIs). 4- Comparing the extracted CPIs with the corresponding validations (the true identifications) and calculate a score for each user and for each insect family. 5- Calculating the relationship degree between users and constructing the users social network graph. 6- Apply a weighted PageRank algorithm on the obtained graph and determine the expert users.

4.1 Merging Users Comments

The comments posted on collections represent 90% of the whole comments on the SPIPOLL website. This is due to the fact that most users prefer to add comments directly to collections rather than on the insect pictures as it avoids several clicks. This situation, prevent us from knowing the precise pictures that users’ refer in their comments. As a solution for this, we compare the validation of each picture belonging to a collection, with its collection comments. Each comment will be attributed to the corresponding picture if this comment contains identification identical to one of the validated picture of the collection. Comments without any identical identification to any pictures validations will be attributed randomly to any picture without comment from the collection. In the end, collections comments will be merged with pictures comments. Figure 1 shows an example of the comments merging process.

Fig. 1.
figure 1

Example of the comments merging process. ID(CC1) and ID(CC2) represent the contained precise identifications on comments CC1 and CC2 respectively. V(P1) and V(P2) represent the validations of the pictures P1 and P2 respectively.

4.2 Extracting Precise Identifications from Comments

In the SPIPOLL, each user can add comments on pictures or collection, to great other observers, to comment the picture esthetics, or to comment identifications. Users can also add identifications in comments if they think that posted identifications are false. Usually, the proposed identifications in comments are used by observers to update its identifications. In some case, users propose wrong identifications which can push the observer to change their correct identifications. For this reason, the comments represent an important key for obtaining reliable identifications. Hence, comments can be used to calculate users’ expertise. In one hand, we suppose that users with high expertise in specific insect family are more likely to add comments with true and precise identification. In other hand, users with low expertise are likely to add comments with wrong identifications. However, some comments can contain an imprecise identification and can’t be used to judge users’ answers. Identification is considered imprecise when it doesn’t contain a term or terms combination that correspond unequivocally to a single insect name. On the contrary, comments with precise identification contain a term or terms combination that correspond unequivocally to a single insect name and can be defined as follows:

$$\begin{aligned} CPI=\left\{ term| \exists \ term\in Unique\_terms\right\} \end{aligned}$$
(1)

With:

term : is a comment term.

\(Unique\_terms\) : is the set of existing unique terms. To obtain the set of unique terms, we apply a text analysis technique on the SPIPOLL’ insect names. First, we transform each insect name to a list of tokens, we then eliminate the stopwords. We mention that unigram unique terms (with one word) which have ambiguous meanings (like brown, garden, day, etc.) have been deleted, because they have insufficient meanings to describe the insects.

4.3 Calculating Relationship Degree Between Users

The extracted CPIs will be used to calculate the relationship degree between users. These comments will be considered as answers, and the posted pictures will be considered as questions which wait for good identifications (answers). In our case, we use only CPIs that have been posted on insects’ pictures of the same family. The relationship between two users will be calculated for one target insect family, using their average answers’ scores of insects which belong to the target insect family. The relationship strength between two users will increment if they exchange good answers (i.e. if their answers are identical with the validations). The relationship degree between two users will decrement if they exchange wrong answers. The difficulty of identification of insect can affect on answers’ gained score. The user will earn more score if he gives good answers for a difficult insect to identify, and will earn less score if he gives good answers for an easy insect to identify. On the other hand, the user will lose fewer score if he gives wrong answers for difficult insect, and will lose more score if he gives wrong answers for an insect easy to identify. The length of the answer can also affect on answers’ gained score. Expert users are expected to give long answers with more unique terms. The relationship degree should be calculated from each user side. Thus, we can calculate the relationship degree between two users A (the commentator) and B (the observer) for specific insects’ family (insects set) f, as follow:

$$\begin{aligned} {relationship}_f\left( A,B\right) =\sum _{taxon\in f}{\frac{{score\_answers}_{tx}(A,B)}{\left| f\right| }} \end{aligned}$$
(2)

\(\left| f\right| \) : is the number of existing insects in the f insect family.

\({score\_answers}_{tx}(A,B)\)represents the score of posted answers of user A on the pictures of the user B, for a specific insect tx. This score is calculated using the following formula:

$$\begin{aligned} {score\_answers}_{tx}(A,B)\ =\frac{\sum _{R\in {Answers}_{tx}(A,B)}{\left\{ \begin{array}{c} \ \ \ \ \ \frac{1}{ease(tx)}*\left| R\right| \ \ \ \ \ \ ,\ \ \ \ \ R=V \\ -\ ease\left( tx\right) *\left| R\right| \ \ \ \ ,\ \ \ \ \ R\ne V \end{array} \right. }}{\sum _{R\in {Answers}_{tx}(A,B)}{|R|}} \end{aligned}$$
(3)

With:

ease(tx) : represent the ease score of the insect tx. This score is high when the insect is easy to identify and is low when it’s hard to identify. This score is calculated as follows:

$$\begin{aligned} ease\left( tx\right) = \frac{Number\ of\ tx\ pictures\ with\ true\ identifications\ }{Total\ number\ of\ tx\ validated\ pictures} \end{aligned}$$
(4)

\({Answers}_{tx}(A,B)\) : is the set of posted answers of user A on the pictures of the user B for the insect tx.

R : is one answer from the set of answers \({Answers}_{tx}(A,B)\).

\(\left| R\right| \) : is the length of the answer, i.e. the size of the largest existing unique term on the answer.

\(\ V\) : is the corresponding picture validation.

In our study, each insect with score higher than 0.65 (the average ease score of all insects), will be considered easy for identification. On the other hand, an insect with a score lower than 0.65, will be considered hard for identification.

4.4 Constructing the Users Social Network

When users (observers) post pictures on the SPIPOLL website, some other users can comment on his pictures. Connecting observers to commentators by directional weighted arrows from observers to commentators, allows us to create the users’ social network. Hence, the SPIPOLL’ users can be organized in a weighted and directed graph G(VE), Where:

V : is the set of users who share or comment pictures of one specific insect family.

E : is the set of directed edges, where \(e_{i,j}\) indicates that user \(u_j\) has commented on one or more pictures of user \(u_i\). These edges are weighted using the friendship degree formula (see Sect. 4.3).

4.5 Calculating Users Expertise Using Weighted PageRank Algorithm

Nowadays, PageRank algorithm has proven its efficiency not only on web pages ranking but also on EF field. Many PageRank-based EF algorithms [5, 15, 18, 21] have proved that PageRank outperforms other algorithms like HITS and Z_scores [21] for EF. However, these studies have applied PageRank only on non-weighted graphs. In our case, we use a weighted PageRank algorithm to extract experts from a weighted graph. Several Weighted PageRank algorithms have been proposed [11, 19] to improve the performance of original PageRank. The weighted PageRank consists of adding weights to different parts of PageRank formula. According to [1, 19], weighted PageRank performs better than traditional PageRank. In our approach, we use the proposed weighted PageRank algorithm by Mihalcea [12]. In this algorithm, the PageRank score of target vertice \(V_a\) is calculated using the weights of coming edges from of its predecessors’ vertices \({In(V}_a)\) and the weights of destined edges to the successors of its predecessors’ vertices \({Out(V}_b)\). In our approach, we calculate the weighted PageRank score for a user A as follows:

$$\begin{aligned} WP(A)=\left( 1-d\right) +d\sum _{B\ \in In(A)}{\frac{{relationship}_f\left( B,A\right) }{\sum _{C\in Out(B)}{{relationship}_f\left( B,C\right) }}*WP\left( B\right) } \end{aligned}$$
(5)

With:

\(B\ \): is a user who has received at least a comment from user A.

In(A) : is the list of users who have received comments from user A.

C : is a user who has commented on pictures or collections of user B.

Out(B) : is the list of users who have commented on the pictures or collections of user B.

\(WP\left( B\right) \) : is the PageRank score of the user B.

d : is a damping factor which can be set between 0 and 1. Similar to the previous studies, we will set the damping factor to 0.85.

5 Experiments

In this section, we evaluate the performance of our proposed approach using a set of validated pictures, observers and commentators from the SPIPOLL. The collected comments are posted on the insects’ pictures of the same family. In our study, we choose the “Apidae” insect family because it contains the most observed insects in SPIPOLL. To show the effectiveness of our proposed approach, we compare it with 2 state-of-the-art methods: the Z-score [21] and ExpertiseRank [21]. To generate the ground truth ranking scores, we use the set of added identifications on the pictures. We calculate for each commentator, his ground truth expertise score for specific insect, by comparing his identifications with the corresponding validations. The ground truth expertise of the user \(U_n\) for the insect \({tx}_m\) can be defined as follows:

$$\begin{aligned} Expertise\left( U_n,{tx}_m\right) = \frac{Number\ of\ correct\ identifications\ posted\ on\ {tx}_m\ \ by\ U_n\ \ }{Totale\ number\ of\ identifications\ posted\ on\ {tx}_m\ by\ U_n} \end{aligned}$$
(6)

The obtained expertise will be used to calculate the user ground truth expertise score for specific insect family. The ground truth expertise of the user \(U_n\) for the insect family \(f_m\) can be defined as follows:

$$\begin{aligned} Expertise\left( U_n,f_m\right) =\frac{\sum _{{tx}_m\in f_m}{Expertise\left( U_n,{tx}_m\right) }}{\left| f_m\right| } \end{aligned}$$
(7)

\(\left| f_m\right| \) : is the number of existing insects in the \(f_m\) insect family.

5.1 Data Preparation

The dataset is obtained from a sample of the SPIPOLL database. We collected the information from all posted pictures and comments from April 2010 to October 2017. In total, we extracted 31329 collections, 307719 pictures, 76288 comments and 1455 users. Among these comments, 28% contain precise identifications. In our case, we use only the posted comments on the insect pictures of the “Apidae” insect family, which represent 12% of all comments. Thus, we obtain a sample which contains 1844 validated pictures, 252 users, and 1866 CPIs. Figure 2 shows the obtained social network using this sample. In this graph, the node size represents the number of connections of the node with the other nodes. Largest nodes have a higher degree of connections than others.

Fig. 2.
figure 2

The obtained social network

5.2 Evaluation Criteria

We evaluate the performance of each algorithm under investigation based on two evaluation metrics: Precision at K (P@K) and Spearmans rho. The first metric measures the proportion of the best commentators (best experts) ranked in the top K results. In our evaluation, each commentator with higher ground truth expertise than 0.4 (the average ground truth expertise of all users), will be considered as best expert. The second metric measures the correlation between the ideal ranking (the ground truth ranking) and the obtained ranking. We calculate the Spearmans rho for the 10, 20 and 40 top ranked commentators.

5.3 Results

Figure 3 shows the obtained precision from the top 10, 20, 30, 40 commentators respectively. We can see that graph-based algorithms perform better than the Z-score algorithm. This result proves that the exploiting of relations among users can improve the performance of experts’ identification. As Fig. 3 shows, our weighted PageRank algorithm also outperforms the ExpertiseRank algorithm, especially in the top 10, 20 and 30 users. The precision of the weighted PageRank algorithm reduces when the number of users increases and is equal to the ExpertiseRank algorithm on the top 40 users. To measure the performance of the three algorithms, we calculated the correlation between each algorithm and the ground truth ratings. Figure 4 illustrates the statistical results regarding Spearman’s rho. From this figure, we can see that for all algorithms, the correlation decreases when the number of users increases. This due to the increasing in the variation between the ideal ranking and the obtained ranking from each algorithm. We can see also that our weighted PageRank algorithm gives a relatively higher correlation than other algorithms, which show that our approach is useful to rank experts than other algorithms. From the obtained results, we can see that our weighted PageRank algorithm outperforms the other EF algorithms.

Fig. 3.
figure 3

Precision at top K commentators.

Fig. 4.
figure 4

The performance of three algorithms in Spearman’s rho distance

6 Conclusions

In this paper, we proposed a new graph-based EF approach for the citizen science platform, the SPIPOLL. This approach exploits users comments and users social relations to predict their expertise on a specific insect family. The relationship between users is extracted from the comments sent by users. Depending on the insects identification ease and the length of comments, the relationship between users can increase or decrease. These relationships have been used to construct a weighted graph. Then, a weighted PageRank algorithm has been applied on the obtained graph to rank the users according to their expertise. We evaluated the performance of our method using a dataset from the SPIPOLL database. Experimental results showed that our method achieve better performance than the state-of-the-art EF algorithms. This is due to the exploitation of the relationship degrees between users and the weighted page-rank algorithm for calculating the users expertise.