Techniques for Clustering Massive Data Sets

Guha, Sudipto; Rastogi, Rajeev; Shim, Kyuseok

doi:10.1007/978-1-4613-0227-8_2

Sudipto Guha⁵,
Rajeev Rastogi⁶ &
Kyuseok Shim⁷

Part of the book series: Network Theory and Applications ((NETA,volume 11))

242 Accesses
2 Citations

Abstract

The wealth of information embedded in huge databases belonging to corporations (e.g., retail, financial, telecom) has spurred a tremendous interest in the areas of knowledge discovery and data mining. Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data. The problem of clustering can be defined as follows: given n data points in a d-dimensional metric space, partition the data points into k clusters such that the data points within a cluster are more similar to each other than data points in different clusters.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining. Proceedings of the Symposium on Management of Data (SIGMOD), 1998.
Google Scholar
N. Alon, S. Dar, M. Parnas, and D. Ron. Testing of clustering. Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2000.
Google Scholar
Vijay Arya, Naveen Garg, Rohit Khandekar, Kamesh Munagala, and Vinayaka Pandit. Local search heuristic for k-median and facility location problems. In Proceedings of the Symposium on Theory of Computing (STOC), pages 21–29, 2001.
Google Scholar
Rakesh Agrawal, King-Ip Lin, Harpreet S. Sawhney, and Kyuseok Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. Proceedings of the International Conference on Very Large Databases (VLDB), pages 490–501, 1995.
Google Scholar
Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Y. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45 (6): 891–923, 1998.
Article MathSciNet MATH Google Scholar
Pankaj K. Agarwal and Cecilia Procopiuc. Approximation algorithms for projective clustering. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 538–547, 2000.
Google Scholar
Sanjeev Arora, Prabhakar Raghavan, and Satish Rao. Approximation schemes for euclidean k -medians and related problems. In Proceedings of the Symposium on Theory of Computing (STOC), pages 106–113, 1998.
Google Scholar
Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60 (3): 630–659, 2000.
Article MathSciNet MATH Google Scholar
Y. Bartal, M. Charikar, and D. Raz. Approximating min-sum k-clustering in metric spaces. Proceedings of the Symposium on Theory of Computing (STOC), 2001.
Google Scholar
N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. Proceedings of the ACM SIGMOD Conference on Management of Data, pages 322–331, 1990.
Google Scholar
A. Borodin, R. Ostrovsky, and Y. Rabani. Subquadratic approximation algorithms for clustering problems in high dimensional spaces. Proceedings of the Symposium on Theory of Computing (STOC), 1999.
Google Scholar
Andrei Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences (SEQUENCES ‘87), pages 21–29. IEEE Computer Society, 1998.
Google Scholar
Moses Charikar, Chandra Chekuri, Tomas Feder, and Rajeev Motwani. Incremental clustering and dynamic information retrieval. In ACM Symposium on Theory of Computing, pages 626–635, 1997.
Google Scholar
Moses Charikar and Sudipto Guha. Improved combinatorial algorithms for the facility location and k-median problems. In IEEE Symposium on Foundations of Computer Science, pages 378–388, 1999.
Google Scholar
M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys. A constant factor approximation algorithm for the k-median problem. Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, 1999.
Google Scholar
M. Charikar Approximation algorithms for clustering problems. PhD Thesis, Stanford University, 2000.
Google Scholar
F. Chudak. Improved approximation algorithms for uncapacitated facility location. Proceedings of Integer Programming and Combinatorial Optimization, LNCS 1412: 180–194, 1998.
MathSciNet Google Scholar
Moses Charikar, Samir Khullera, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 642–651, 2001.
Google Scholar
D. Cutting, D. Karger, Jan Pedersen, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. SIGIR, 1992.
Google Scholar
K. L. Clarkson. A randomized algorithm for closestpoint queries. SIAM Journal on Computing, 17, 1988.
Google Scholar
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, 2nd ed. MIT Press, 2001.
Google Scholar
H. S. M. Coxeter. An upper bound for the number of equal nonoverlapping speheres that can touch each another of the same size. Symposia in Pure Mathematics, 7: 53–71, 1964.
Google Scholar
Moses Charikar and Rina Panigrahy. Clustering to minimize the sum of cluster diameters. Proceedings of the Symposium on Theory of Computing (STOC), pages 1–10, 2001.
Google Scholar
C. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. Proceedings of the Symposium on Theory of Computing (STOL), 1987.
Google Scholar
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973.
MATH Google Scholar
P. Drineas, R. Kannan, A. Frieze, and V. Vinay. Clustering in large graphs and matrices. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 1999.
Google Scholar
M. Ester, H. Kriegel, J. Snader, and X. Xu. A density-based algorithm for discovering clusters in large spatial database with noise. International Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), 1996.
Google Scholar
M. Ester, H. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. International Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), 1995.
Google Scholar
J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3: 209–226, 1977.
Article MATH Google Scholar
Toms Feder and Daniel H. Greene. Optimal algorithms for appropriate clustering. Proceedings of the Symposium on Theory of Computing (STOC), pages 434–444, 1988.
Google Scholar
V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS—Clustering categorical data using summaries. International Conference on Knowledge Discovery in Databases and Data Mining (KDD-99), 1999.
Google Scholar
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. Proceedings of the 24’th International Conference on Very Large Data Bases, 1998.
Google Scholar
S. Guha, H. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Approximate xml joins. Proceedings of the Symposium on Management of Data (SIGMOD), pages 287–298, 2002.
Google Scholar
S. Guha and S. Khuller. Greedy strikes back: Improved facility location algorithms. Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 649–657, 1998.
Google Scholar
S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2000.
Google Scholar
S. Guha, R. Rastogi, and K. Shim. CURE: An efficient algorithm for clustering large databases. Proceedings of the Symposium on Management of Data (SIGMOD), 1998.
Google Scholar
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. Proceedings of ICDE, 1999.
Google Scholar
T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, pages 293–306, 1985.
Google Scholar
Sudipto Guha. Approximation algorithms for facility location problems. Ph.D. Thesis, Stanford University, 2000.
Google Scholar
E. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering based on association rule hypergraphs. Technical report, 1997 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
Google Scholar
D. Hochbaum and D. B. Shmoys. A best possible heuristic for the k-center problem. Math of Operations Research, 10 (2): 180–184, 1985.
Article MathSciNet MATH Google Scholar
P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. Proceedings of the Symposium on Theory of Computing (STOC), 1998.
Google Scholar
Piotr Indyk, Rajeev Motwani, Prabhakar Raghavan, and Santosh Vem-pala. Locality-preserving hashing in multidimensional spaces. Proceedings of the Symposium on Theory of Computing (STOC), pages 618–625, 1997.
Google Scholar
P. Indyk. Sublinear time algorithms for metric space problems. Proceedings of the Symposium on Theory of Computing, 1999.
Google Scholar
Piotr Indyk. A sublinear time approximation scheme for clustering in metric spaces. Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 154–159, 1999.
Google Scholar
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
Google Scholar
K. Jain, M. Mandian, and A. Saberi. A new greedy approach for facility location problem. Proceedings of the Symposium on Theory of Computing (STOC), 2002.
Google Scholar
K. Jain and V. Vazirani. Primal-dual approximation algorithms for metric facility location and k-median problems. Proceedings of the Twenty-Ninth Annual IEEE Symposium on Foundations of Computer Science, 1999.
Google Scholar
George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. Multilevel hypergraph partitioning: Application in VLSI domain. Proceedings of the ACM/IEEE Design Automation Conference, 1997.
Google Scholar
O. Kariv and S. L. Hakimi. An algorithmic approach to network location problems, part ii: p-media ns. SIAM Journal on Applied Mathematics, pages 539–560, 1979.
Google Scholar
Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. Proceedings of the Symposium on Theory of Computing (STOC), pages 614–623, 1998.
Google Scholar
S. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the euclidean k-median problem. Proc. 7th European Symposium on Algorithms, pages 378–389, 1999.
Google Scholar
Ravi Kannan, Santosh Vempala, and Adrian Vetta. On clusterings: Good, bad and spectral. Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 367–377, 2000.
Google Scholar
J. H. Lin and J. S. Vitter. Approximation algorithms for geometric median problems. Information Processing Letters, 44: 245–249, 1992.
Article MathSciNet MATH Google Scholar
J. H. Lin and J. S. Vitter. c-approximations with minimum packing constraint violations. Proceedings of the Twenty-Fourth Annual ACM Symposium on Theory of Computing, 1992.
Google Scholar
O. L. Managasarian. Mathematical programming in data mining. Data Mining and Knowledge Discovery, 1997.
Google Scholar
P. Mirchandani and R. Francis, editors. Discrete Location Theory. John Wiley and Sons, Inc., New York, 1990.
MATH Google Scholar
Nina Mishra, Dan Oblinger, and Leonard Pitt. Sublinear time approximate clustering. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2001.
Google Scholar
R. Mettu and C. G. Plaxton. The onlike median problem. Proceedings of the 41st IEEE Foundations of Computer Science, 2000.
Google Scholar
Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering. Manuscript, 2002.
Google Scholar
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
Google Scholar
S. Muthukrishnan. Efficient algorithms for document retrieval problems. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002.
Google Scholar
Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. Proceedings of the 20’th International Conference on Very Large Data Bases, 1994.
Google Scholar
C. F. Olson. Parallel algorithms for hierarchical clustering. Technical report, University of California at Berkeley, 1993.
Google Scholar
Liadan O’Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, and Rajeev Motwani. Streaming-data algorithms for high-quality clustering. Proceedings of ICDE, 2002.
Google Scholar
Rafail Ostrovsky and Yuval Rabani. Polynomial time approximation schemes for geometric k-clustering. Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2000.
Google Scholar
Cecilia Procopiuc, Michael Jones, Pankaj K. Agarwal, and T. M. Murali. A monte cario algorithm for fast projective clustering. Proceedings of the Symposium on Management of Data (SIGMOD), 2002.
Google Scholar
Hanan Samet. The Design and Analysis of Spatial Data Structures. Addison Wesley, 1990.
Google Scholar
T. Sellis, N. Roussopoulos, and C. Faloutsos. The R+tree: a dynamic index for multi-dimensional objects. Proceedings of the 13th International Conference on Very Large Data Bases, pages 507–518, 1987.
Google Scholar
Kyuseok Shim, Ramakrishnan Srikant, and Rakesh Agrawal. High-dimensional similarity joins. pages 301–311, 1997.
Google Scholar
D. B. Shmoys, É. Tardos, and K. Aardal. Approximation algorithms for facility location problems. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pages 265–274, 1997.
Google Scholar
Mikkel Thorup. Quick k-median, k-center, and facility location for sparse graphs. ICALP, pages 249–260, 2001.
Google Scholar
H. Toivonen. Samping large databases for association rules. Proceedings of the International Conference on Very Large Databases (VLDB), 1996.
Google Scholar
Vijay Vazirani. Approximation Algorithms. Springer Verlag, 2001.
Google Scholar
J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11 (1): 37–57, 1985.
Article MathSciNet MATH Google Scholar
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. Proceedings of the ACM SIGMOD Conference on Management of Data, pages 103–114, 1996.
Google Scholar
K. Zhang and D. Sasha. Tree pattern matching. In Apocolisto and Galil, editors, Pattern Matching Algorithms. Oxford University Press, 1997.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Information Sciences, University of Pennsylvania, Philadelphia, PA, 19104, USA
Sudipto Guha
Bell Laboraties, Lucent Technologies, Murray Hill, NJ, 07974, USA
Rajeev Rastogi
School of Electrical Engineering and Computer Science, Seoul National University, Kwanak P.O. Box 34, Seoul, 151-742, Korea
Kyuseok Shim

Authors

Sudipto Guha
View author publications
You can also search for this author in PubMed Google Scholar
Rajeev Rastogi
View author publications
You can also search for this author in PubMed Google Scholar
Kyuseok Shim
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Guha, S., Rastogi, R., Shim, K. (2004). Techniques for Clustering Massive Data Sets. In: Clustering and Information Retrieval. Network Theory and Applications, vol 11. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-0227-8_2

Download citation

DOI: https://doi.org/10.1007/978-1-4613-0227-8_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7949-2
Online ISBN: 978-1-4613-0227-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics