Flocking-based Document Clustering on the Graphics Processing Unit

Charles, Jesse St.; Potok, Thomas E.; Patton, Robert; Cui, Xiaohui

doi:10.1007/978-3-540-78987-1_3

Jesse St. Charles⁵,
Thomas E. Potok⁶,
Robert Patton⁶ &
…
Xiaohui Cui⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 129))

870 Accesses
4 Citations

Abstract

Analyzing and grouping documents by content is a complex problem. One explored method of solving this problem borrows from nature, imitating the flocking behavior of birds. Each bird represents a single document and flies toward other documents that are similar to it. One limitation of this method of document clustering is its complexity O(n ²). As the number of documents grows, it becomes increasingly difficult to receive results in a reasonable amount of time. However, flocking behavior, along with many naturally inspired algorithms such as ant colony optimization and particle swarm optimization, are highly parallel and have found increased performance on expensive cluster computers. In the last few years, the graphics processing unit (GPU) has received attention for its ability to solve highlyparallel and semi-parallel problems much faster than the traditional sequential processor. Some applications see a huge increase in performance on this new platform. The cost of these high-performance devices is also marginal when compared with the price of cluster machines. In this paper, we have conducted research to exploit this architecture and apply its strengths to the document flocking problem. Our results highlight the potential benefit the GPU brings to many naturally inspired algorithms. Using the CUDA platform from NIVIDA®, we developed a document flocking implementation to be run on the NIVIDA® GEFORCE 8800. Additionally, we developed a similar but sequential implementation of the same algorithm to be run on a desktop CPU. We tested the performance of each on groups of news articles ranging in size from 200 to 3000 documents. The results of these tests were very significant. Performance gains ranged from three to nearly five times improvement of the GPU over the CPU implementation. Our results also confirm that each implementation is of similar complexity, confirming that gains are from the hardware and not from algorithmic benefits. This improvement in runtime makes the GPU a potentially powerful new platform for document analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Anderberg M R (1973) Cluster Analysis for Applications. Academic Press, Inc. New York.
MATH Google Scholar
Jain A K, Murty M N, Flynn P J, (1999) Data clustering: a review. ACM Computing Surveys 31:264–323.
Article Google Scholar
Owens J D, et al (2007) A Survey of General Purpose Computation on Graphics Hardware. Computer Graphics Forum Volume 26:80–113.
Article Google Scholar
Cui X, Potok T (2006) A Distributed Flocking Approach for Information Stream Clustering Analysis. snpd-sawn, pp. 97–102, Seventh ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD’06).
Google Scholar
Reynolds C W (1987) Flocks, Herds, and Schools: A Distributed Behavioral Model. Computer Graphics (ACM) 21:25–34.
Article Google Scholar
NIVIDA^®;(2007) NIVIDA^®; CUDA: Compute Unified Device Architecture NIVIDA^®;, http://developer.NIVIDA.com/cuda, Version 1.0.
Reed J, et al (2006) TF-ICF: A New Term Weighting Scheme for Clustering Dynamic Data Streams. in: Proc. Machine Learning and Applications. ICMLA ’06, pp. 258–263.
Google Scholar
Fang R, et al (2007) GPUQP: query co-processing using graphics processors. in: Proceedings of the 2007 ACM SIGMOD international conference on Management of data, pp. 1061–1063.
Google Scholar
Jain A K, Murty M N, Flynn P J (1999) Data clustering: a review. ACM Computing Surveys 31:264–323.
Article Google Scholar
Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. KDD Workshop on Text Mining, pp 20–23.
Google Scholar
Selim S Z, Ismail M A (1984) K-Means-Type Algorithms: A Generalized Convergence Theorem and Characterization of Local Optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-6, pp. 81–87.
Google Scholar
Chitty D (2007) A Data Parallel Approach to Genetic Programming Using Programmable Graphics Hardware. Proceedings of the 9th annual conference on Genetic and evolutionary computation, pp. 1566–1573.
Google Scholar
Rick T, Mathar R (2007) Fast Edge-Diffraction-Based Radio Wave Propagation Model for Graphics Hardware. Proceedings of ITG INICA.
Google Scholar
Rodrguez-Ramos J, et al (2006) Modal Fourier wavefront reconstruction on graphics processing units. Proceedings of the SPIE, Volume 6272, pp. 15.
Google Scholar
Yamagiwa S, et al (2007) Data Buffering Optimization Methods toward a Uniform Programming Interface for GPU-based Applications. Proceedings of the 4th international conference on Computing frontiers, pp. 205–212.
Google Scholar
Porter M F (1980) An algorithm for suffix stripping. Program, 14 no. 3, pp 130–137.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Tennessee, Chattanooga, TN, USA
Jesse St. Charles
Oak Ridge National Laboratory, Oak Ridge, TN, USA
Thomas E. Potok, Robert Patton & Xiaohui Cui

Authors

Jesse St. Charles
View author publications
You can also search for this author in PubMed Google Scholar
Thomas E. Potok
View author publications
You can also search for this author in PubMed Google Scholar
Robert Patton
View author publications
You can also search for this author in PubMed Google Scholar
Xiaohui Cui
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Sciences and Information Technology, University of Nottingham, Jubilee Campus, Nottingham, NG81BB, UK
Natalio Krasnogor
Department of Mathematics and Computer Science, University of Catania, v.le A. Doria, 6, 95125, Catania, Italy
Giuseppe Nicosia & Mario Pavone &
Department of Computer Science and Artificial Intelligence E.T.S. Ingenieria Informatica C/ Periodista Daniel Saucedo Aranda s/n, University of Granada, 18071, Granada, Spain
David Pelta

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Charles, J.S., Potok, T.E., Patton, R., Cui, X. (2008). Flocking-based Document Clustering on the Graphics Processing Unit. In: Krasnogor, N., Nicosia, G., Pavone, M., Pelta, D. (eds) Nature Inspired Cooperative Strategies for Optimization (NICSO 2007). Studies in Computational Intelligence, vol 129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-78987-1_3

Download citation

DOI: https://doi.org/10.1007/978-3-540-78987-1_3
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-78986-4
Online ISBN: 978-3-540-78987-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics