Text-Based Content Search and Retrieval in Ad-hoc P2P Communities
We consider the problem of content search and retrieval in peer-to-peer (P2P) communities. P2P computing is a potentially powerful model for information sharing between ad hoc groups of users because of its low cost of entry and natural model for resource scaling. As P2P communities grow, however, locating information distributed across the large number of peers becomes problematic. We address this problem by adapting a state-of-the-art text-based document ranking algorithm, the vector-space model instantiated with the TFxIDF ranking rule, to the P2P environment. We make three contributions: (a) we show how to approximate TFxIDF using compact summaries of individual peers’ inverted indexes rather than the inverted index of the entire communal store; (b) we develop a heuristic for adaptively determining the set of peers that should be contacted for a query; and (c) we show that our algorithm tracks TFxIDF’s performance very closely, giving P2P communities a search and retrieval algorithm as good as that possible assuming a centralized server.
KeywordsRelevant Document Retrieval Algorithm Community Size Inverted Index Inverse Document Frequency
Unable to display preview. Download preview PDF.
- 3.C. Buckley. Implementation of the SMART information retrieval system. Technical Report TR85-686, Cornell University, 1985.Google Scholar
- 4.J. P. Callan, Z. Lu, and W. B. Croft. Searching Distributed Collections with Inference Networks. In Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 21–28, 1995.Google Scholar
- 5.I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong. Freenet: A distributed anonymous information storage and retrieval system. In Workshop on Design Issues in Anonymity and Unobservability, pages 46–66, 2000.Google Scholar
- 6.F.M. Cuenca-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen. PlanetP: Infrastructure Support for P2P Information Sharing. Technical Report DCS-TR-465, Department of Computer Science, Rutgers University, Nov. 2001.Google Scholar
- 7.A. Demers, D. Greene, C. Hauser, W. Irish, J. Larson, S. Shenker, H. Sturgis, D. Swinehart, and D. Terry. Epidemic algorithms for replicated database maintenance. In Proceedings of the Sixth Annual ACM Symposium on Principles of Distributed Computing, pages 1–12, 1987.Google Scholar
- 8.F. Douglis, A. Feldmann, B. Krishnamurthy, and J. C. Mogul. Rate of change and other metrics: a live study of the world wide web. In USENIX Symposium on Internet Technologies and Systems, 1997.Google Scholar
- 9.J. C. French, A. L. Powell, J. P. Callan, C. L. Viles, T. Emmitt, K. J. Prey, and Y. Mou. Comparing the performance of database selection algorithms. In Research and Development in Information Retrieval, pages 238–245, 1999.Google Scholar
- 10.D. K. Gifford, P. Jouvelot, M. A. Sheldon, and J. W. O. Jr. Semantic File Systems. In Proceedings of the 13 th ACM Symposium on Operating Systems Principles, 1991.Google Scholar
- 11.Gnutella. http://gnutella.wego.com.
- 12.L. Gravano, H. Garcia-Molina, and A. Tomasic. The effectiveness of gloss for the text database discovery problem. In Proceedings of the ACM SIGMOD Conference, pages 126–137, 1994.Google Scholar
- 13.D. Harman. Overview of the first trec conference. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1993.Google Scholar
- 14.KaZaA. http://www.kazaa.com/.
- 15.J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. Oceanstore: An architecture for global-scale persistent storage. In Proceedings of ACM ASPLOS, 2000.Google Scholar
- 16.Napster. http://www.napster.com.
- 17.A. Oram, editor. Peer-to-Peer: Harnessing the Power of Disruptive Technologies. O’Reilly Press, 2001.Google Scholar
- 18.S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. A scalable content addressable network. In Proceedings of the ACM SIGCOMM’ 01 Conference, 2001.Google Scholar
- 20.D. Roselli, J. Lorch, and T. Anderson. A comparison of file system workloads. In Proceedings of the 2000 USENIX Annual Technical Conference, June 2000.Google Scholar
- 21.A. Rowstron and P. Druschel. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms (Middleware), 2001.Google Scholar
- 23.S. Saroiu, P. K. Gummadi, and S. D. Gribble. A measurement study of peer-to-peer file sharing systems. In Proceedings of Multimedia Computing and Networking (MMCN), 2002.Google Scholar
- 24.I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of the ACM SIGCOMM’ 01 Conference, 2001.Google Scholar
- 25.I. Witten, A. Moffat, and T. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, San Francisco, second edition, 1999.Google Scholar
- 26.B. Yang and H. Garcia-Molina. Efficient search in peer-to-peer networks. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS), July 2002.Google Scholar
- 27.Y Zhao, J. Kubiatowicz, and A. Joseph. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Technical Report UCB/CSD-01-1141, University of California, Berkeley, 2000.Google Scholar