Approximate Object Location and Spam Filtering on Peer-to-Peer Systems
Recent work in P2P overlay networks allow for decentralized object location and routing (DOLR) across networks based on unique IDs. In this paper, we propose an extension to DOLR systems to publish objects using generic feature vectors instead of content-hashed GUIDs, which enables the systems to locate similar objects.We discuss the design of a distributed text similarity engine, named Approximate Text Addressing (ATA), built on top of this extension that locates objects by their text descriptions. We then outline the design and implementation of a motivating application on ATA, a decentralized spam-filtering service. We evaluate this system with 30,000 real spam email messages and 10,000 non-spam messages, and find a spam identification ratio of over 97% with zero false positives.
KeywordsFeature Vector Feature Object User Agent Distribute Hash Table Query Message
- 1.Broder, A. Z. Some applications of rabin’s fingerprint method. In Sequences II: Methods in Communications, Security, and Computer Science, R. Capocelli, A. D. Santis, and U. Vaccaro, Eds. Springer Verlag, 1993, pp. 143–152.Google Scholar
- 2.Dabek, F., Zhao, B.Y., Druschel, P., Kubiatowicz, J., AND Stoica, I. Towards a common API for structured P2P overlays. In Proceedings of IPTPS (Berkeley, CA, February 2003).Google Scholar
- 3.Distributed checksum clearinghouse. http://www.rhyolite.com/anti-spam/dcc/.
- 4.Harvey, N. J. A., Jones, M. B., Saroiu, S., Theimer, M., AND Wolman, A. Skipnet: A scalable overlay network with practical locality properties. In Proceedings of USITS(Seattle, WA, March 2003), USENIX.Google Scholar
- 5.Hildrum, K., Kubiatowicz, J. D., Rao, S., AND Zhao, B.Y. Distributed object location in a dynamic network. In Proceedings of ACM SPAA (Winnipeg, Canada, August 2002).Google Scholar
- 6.Li, J., Loo, B. T., Hellerstein, J., Kaashoek, F., Karger, D. R., AND Morris, R. On the feasibility of peer-to-peer web indexing and search. In 2nd International Workshop on Peer-to-Peer Systems (Berkeley, California, 2003).Google Scholar
- 7.Manber, U. Finding similar files in a large file system. In Proceedings of Winter USENIX Conference (1994).Google Scholar
- 8.Maymounkov, P., AND Mazieres, D. Kademlia: A peer-to-peer information system based on the XOR metric. In Proceedings of 1st International Workshop on Peer-to-Peer Systems (IPTPS) (Cambridge, MA, March 2002).Google Scholar
- 9.Mozilla spam filtering. http://www.mozilla.org/mailnews/spam.html.
- 10.Ratnasamy, S., Francis, P., Handley, M., Karp, R., AND Schenker, S. A scalable content-addressable network. In Proceedings of SIGCOMM (August 2001).Google Scholar
- 11.Rowstron, A., AND Druschel, P. Pastry: Scalable, distributed object location and routing for large-scale peer-to-peer systems. In Proceedings of IFIP/ACM Middleware 2001 (November 2001).Google Scholar
- 12.Sahami, M., Dumais, S., Heckerman, D., AND Horvitz, E. A bayesian approach to filtering junk email. In AAAI Workshop on Learning for Text Categorization (Madison, Wisconsin, July 1998).Google Scholar
- 13.Spamassassin. http://spamassassin.org.
- 14.Spamnet. http://www.cloudmark.com.
- 15.Stoica, I., Morris, R., Karger, D., Kaashoek, M. F., AND Balakrishnan, H. Chord: A scalable peer-to-peer lookup service for internet applications. In Proceedings of SIGCOMM (August 2001).Google Scholar
- 16.Vipul’s razor. http://www.razor.sourceforge.net/.
- 17.Witten, I. H., Moffat, A., AND Bell, T. C. Managing Gigabytes: Compressing and Indexing Documents and Images, second ed. Morgan Kaufmann Publishing, 1999.Google Scholar
- 18.Zhao, B. Y., Kubiatowicz, J. D., AND Joseph, A. D. Tapestry: An infrastructure for fault-tolerant wide-area location and routing. Tech. Rep. UCB/CSD-01-1141, U.C. Berkeley, April 2001.Google Scholar