1 Introduction

Content-based search is one of the most important functions provided in peer-to-peer (P2P) content management systems and aims to precisely locate the files containing the desired contents. While P2P systems are usually large-scale, heterogeneous, highly distributed, and dynamically changed, it is challenging to perform efficient and robust content-based search.

Recently, some content-based search algorithms [13] over structured P2P overlays are proposed by mapping the shared contents to low-dimensional keys. The mapping operations adopted in above systems inflict high cost to publish and maintain the key-based indexes, especially in the dynamic networks where peers join and depart frequently.

In this paper, we present a probably constructed P2P network, GRACE, to support efficient, resilient content-based search with low maintenance cost. In the GRACE network, peers build the neighboring connections probably according to the gradient distance between their shared documents. Compared with traditional P2P content-based search algorithms, GRACE can support efficient content-based search of text documents with relatively low maintenance cost. GRACE needs no overhead of publishing and maintenance of key-based indexes, which is necessary in traditional search algorithms [13] based on structured overlays. The maintenance cost of GRACE is less than 1 ‰ of those.

The rest of this paper is organized as follows. Section 2 introduces related works. Section 3 presents the model of the GRACE network. The experimental results are discussed in Sect. 4. Finally, Sect. 5 summarizes the paper.

2 Related Work

Structured overlay networks [46] have a kind of distributed infrastructures to support O (log N) key-based search, which are usually named as the distributed hash table (DHT). The topology of the network is constructed based on the global unique ID assigned to each peer. Given any numeric key, the structured overlay can locate the peer having the closest ID to the key within O (log N) hops.

Several recent research efforts [13, 7] build content-based search algorithms over DHT. Specifically, the researches [1, 2] map each keyword of a shared text document into a unique numeric key by a uniform hash function to support P2P full-text search. Muller et al. [3] propose a content-based image retrieval algorithm by mapping the color histogram vector of each image into a fixed set of numeric keys. Tang et al. [7] map the high-dimensional document vectors into low-dimensional numeric keys by using the latent semantic index (LSI) method. All of them need to publish index items constructed as <key, element> in the networks to support efficient search, which can cause the following latent problems:

  • High communication cost is required to maintain the consistence of the index in the dynamic network.

  • Uneven distribution of the shared data elements can cause load imbalance in these systems. In full-text search systems [1, 2], the Zipf-distributed keyword popularity can make the load of the peers maintaining the index items of hot keywords extremely heavy.

3 The GRACE Network

3.1 Basic Model

The GRACE model builds up connection probably between any pair of peers according to the gradient distance between their shared documents. The gradient distance is a kind of document distance with the following gradient property: Given any document v in a document dataset, the number of remaining documents within x distance of v is proportional to x β, x ∈ (1, +∞), where β is a constant bigger than zero.

In the GRACE model, the gradient distance between any pair of documents is defined as follows:

$$ d(v_{i} ,v_{j} ) = 1/\left( {\frac{{v_{i} \cdot v_{j} }}{{|v_{i} ||v_{j} |}} + \theta } \right)\quad (0 < \theta < 1) $$
(1)

Here, θ is a constant to prevent the denominator to be zero. v i and v j are the content vectors of the two documents. For a text document, the content vector is defined as the term vector [8], whose elements indicate the frequencies of the terms appearing in the document.

Based on the gradient distance defined above, any peer P i in the GRACE network selects another peer P j as its neighbor with the following probability:

$$ l(P_{i} ,P_{j} ) = \alpha |H(P_{i} )|^{ - 1} \sum\limits_{{v_{m} \in H(P_{i} )}} {} \sum\limits_{{v_{n} \in H(P_{j} )}} {d(v_{m} ,v_{n} )^{ - \beta } f(v_{n} )}^{ - 1} $$
(2)

where H(P i ) and H(P j ) denote the document sets shared by P i and P j . d(v m , v n ) is the gradient distance between any two documents v m and v n shared by P i and P j , respectively. f(v n ) means the ratio of peers in the network sharing the document v n , and the parameter β is the constant defined in the gradient property of the distance. \( \alpha \in (0, + \infty ) \) is a constant used to normalize the probability.

3.2 Construction of the GRACE Network

The main task of constructing a GRACE network is to build up the neighborship of online peers. A simple solution is to select the neighbors of each peer according to the connection probability with all existing peers in the network calculated as Eq. (2). However, this global selection strategy is impractical for its huge communication cost. A more efficient two-phase selection mechanism is adopted in GRACE as follows:

Phase I: Each peer selects several peers randomly from the network to form its candidate set.

Phase II: The peer selects peers from the candidate set as its neighbors. The probability for a candidate to be selected is calculated according to Eq. (2).

To implement the above phases in a distributed manner, each peer maintains four tables: Host Table, Random Seed Table, Candidate Table, and Neighbor Table.

The Host Table is used to store the contact information of the hosts appearing in the peer’s interaction history. For a peer joining the GRACE network for the first time, the Host Table is initialized to contain some stable online peers like what Emule [9] do.

The Random Seed Table consists of peers randomly selected from the Host Table and used for candidate selection in Phase I. Each host in the table is named as a random seed. The Candidate Table records the selected candidate peers.

The procedure to construct the Neighbor Table is illustrated in Algorithm 1. Step 1 and step 2 of Algorithm 1 are executed to construct the Random Seed Table from the Host Table. Steps 3–9 illustrate how to construct the Candidate Table by interacting with the random seeds. The candidates selected here can be viewed as a random sample of all peers in the network. Step 10 and step 11 are to select the GRACE neighbors from the candidates according to the probability defined in Eq. (2). The document collections H(P i ) transferred in step 4 and step 5 are presented as content vectors and can be compressed to a small size in practical systems.

Algorithm 1: The joining procedure of a peer

Input: P 0 : The peer to join the network.

Output: The Neighbor Table of P 0 .

Notations:

Hosts (P i ): The Host Table of the peer P i .

Random_Seeds (P i ): The Random Seed Table of P i .

Candidates (P i ): The Candidate Table of P i .

Neighbors (P i ): The Neighbor Table of P i .

Method:

//construct P 0 ’s Random Seed Table.

1. for \( \forall \)(P i )\( \in \) Hosts (P 0 ) do

2. P 0 inserts P i to Random_Seeds (P 0 ) with probability c.

//construct P 0 ’s Candidate Table.

3. for \( \forall \)(P i )\( \in \) Random_Seeds(P 0 ) do

//exchange the shared document vectors

4. P 0 sends H(P 0 ) to P i

5. P i sends H(P i ) to P 0

//select all Random Seeds of P i as candidates

6. for \( \forall \)(P j )\( \in \) Random_Seeds (P i ) do

7. Pr j ←  \( l(P_{0} ,P_{j} ) \)//Eq. (2)

8. P i send (P j , Pr j ) to P 0

9. P 0 insert (P j , Pr j ) into candidates(P 0 )

//select the GRACE neighbors form candidates

10. for (P j , Pr j )\( \in \) candidates(P 0 ) do

11. P 0 insert P j to Neighbors (P 0 ) with probability Pr j

3.3 Content-Based Search in GRACE

A simple greedy algorithm is adopted to perform content-based search in GRACE. While a peer forwards a query in the network, it selects another peer from its Neighbor Table as the next hop, which shares the document with minimal distance to the query. Each searched peer independently checks its local stored documents and returns back the records of the top k documents closest to the query. Then, the initiator of the query gathers all returned records, ranks the documents according to the distance to the query, and composes the search result as the top k documents.

4 Experiments

4.1 Data Collection in Experiments

We evaluate the experimental systems on the dataset TREC WT10G [11], which contains 1,692,096 Web documents from 11,680 Web sites. We deem each Web site to be a peer and associate the documents on a site to the corresponding peer as shared contents. We also construct the query set containing 100 queries by randomly sampling as [12]. Each query is obtained by first randomly choosing a document from the dataset and then randomly choosing 5 terms from the document as the searched keywords of the query. The search task is to search the top k documents most related to the query.

4.2 Comparison with Other Overlays

We compare GRACE with two algorithms described as follows:

  • K-Chord [1, 2]. K-Chord is based on Chord [5] and supports keyword-based search.

  • Rand. Connections are randomly built among peers [10] and random-walker mechanism is adopted to route search requests.

4.3 Search Latency Analysis

To understand how the performance of GRACE is affected by the network scale, we randomly select peers from the whole datasets to form the subsets with different scales. Then, we build the P2P networks over each subset and test the search efficiency.

Figure 1 shows the average search latency of GRACE, Rand, and K-Chord under different network scales in the WT10G dataset. We can see that GRACE is more efficient than K-Chord and Rand to locate the top k documents most similar to the query.

Fig. 1
figure 1

Search latency in different networks

4.4 Communication Cost

Figure 2 illustrates the communication cost of different systems run on the WT10G dataset. The y-axis of the figure is the communication cost measured by the number of messages. We can see that the joining cost of K-Chord is much higher than that of GRACE and Rand. Moreover, in K-Chord, the cost for joining and indexing maintenance makes up the largest percentage. In fact, this overhead is required in most of structured overlay-based systems for publishing and maintaining data indexes. Compared with these systems, the advantage of GRACE is to support efficient search with no need of index publishing, so its maintenance cost can be less than 1 ‰ of K-Chord.

Fig. 2
figure 2

Communication cost of network maintenance. Here, Join means the cost for a new peer to join the overlay. PTM means periodical topology maintenance cost. PIM means periodical index maintenance cost. Leave means the cost for a peer to leave the overlay

5 Conclusions

The GRACE presented in this paper probably constructs connections among peers according to the gradient distance of their shared contents. GRACE provides efficient content-based search according to the similarity of shared contents without any additional key-based index. Experiments validate the search performance of GRACE and show that its maintenance cost is less than 1 ‰ of structured overlay-based search systems.