Abstract
The wealth of information embedded in huge databases belonging to corporations (e.g., retail, financial, telecom) has spurred a tremendous interest in the areas of knowledge discovery and data mining. Clustering, in data mining, is a useful technique for discovering interesting data distributions and patterns in the underlying data. The problem of clustering can be defined as follows: given n data points in a d-dimensional metric space, partition the data points into k clusters such that the data points within a cluster are more similar to each other than data points in different clusters.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining. Proceedings of the Symposium on Management of Data (SIGMOD), 1998.
N. Alon, S. Dar, M. Parnas, and D. Ron. Testing of clustering. Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2000.
Vijay Arya, Naveen Garg, Rohit Khandekar, Kamesh Munagala, and Vinayaka Pandit. Local search heuristic for k-median and facility location problems. In Proceedings of the Symposium on Theory of Computing (STOC), pages 21–29, 2001.
Rakesh Agrawal, King-Ip Lin, Harpreet S. Sawhney, and Kyuseok Shim. Fast similarity search in the presence of noise, scaling, and translation in time-series databases. Proceedings of the International Conference on Very Large Databases (VLDB), pages 490–501, 1995.
Sunil Arya, David M. Mount, Nathan S. Netanyahu, Ruth Silverman, and Angela Y. Wu. An optimal algorithm for approximate nearest neighbor searching fixed dimensions. Journal of the ACM, 45 (6): 891–923, 1998.
Pankaj K. Agarwal and Cecilia Procopiuc. Approximation algorithms for projective clustering. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 538–547, 2000.
Sanjeev Arora, Prabhakar Raghavan, and Satish Rao. Approximation schemes for euclidean k -medians and related problems. In Proceedings of the Symposium on Theory of Computing (STOC), pages 106–113, 1998.
Andrei Z. Broder, Moses Charikar, Alan M. Frieze, and Michael Mitzenmacher. Min-wise independent permutations. Journal of Computer and System Sciences, 60 (3): 630–659, 2000.
Y. Bartal, M. Charikar, and D. Raz. Approximating min-sum k-clustering in metric spaces. Proceedings of the Symposium on Theory of Computing (STOC), 2001.
N. Beckmann, H. Kriegel, R. Schneider, and B. Seeger. The R*-tree: an efficient and robust access method for points and rectangles. Proceedings of the ACM SIGMOD Conference on Management of Data, pages 322–331, 1990.
A. Borodin, R. Ostrovsky, and Y. Rabani. Subquadratic approximation algorithms for clustering problems in high dimensional spaces. Proceedings of the Symposium on Theory of Computing (STOC), 1999.
Andrei Z. Broder. On the resemblance and containment of documents. In Compression and Complexity of Sequences (SEQUENCES ‘87), pages 21–29. IEEE Computer Society, 1998.
Moses Charikar, Chandra Chekuri, Tomas Feder, and Rajeev Motwani. Incremental clustering and dynamic information retrieval. In ACM Symposium on Theory of Computing, pages 626–635, 1997.
Moses Charikar and Sudipto Guha. Improved combinatorial algorithms for the facility location and k-median problems. In IEEE Symposium on Foundations of Computer Science, pages 378–388, 1999.
M. Charikar, S. Guha, É. Tardos, and D. B. Shmoys. A constant factor approximation algorithm for the k-median problem. Proceedings of the Thirty-First Annual ACM Symposium on Theory of Computing, 1999.
M. Charikar Approximation algorithms for clustering problems. PhD Thesis, Stanford University, 2000.
F. Chudak. Improved approximation algorithms for uncapacitated facility location. Proceedings of Integer Programming and Combinatorial Optimization, LNCS 1412: 180–194, 1998.
Moses Charikar, Samir Khullera, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 642–651, 2001.
D. Cutting, D. Karger, Jan Pedersen, and J. Tukey. Scatter/gather: A cluster-based approach to browsing large document collections. SIGIR, 1992.
K. L. Clarkson. A randomized algorithm for closestpoint queries. SIAM Journal on Computing, 17, 1988.
Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, 2nd ed. MIT Press, 2001.
H. S. M. Coxeter. An upper bound for the number of equal nonoverlapping speheres that can touch each another of the same size. Symposia in Pure Mathematics, 7: 53–71, 1964.
Moses Charikar and Rina Panigrahy. Clustering to minimize the sum of cluster diameters. Proceedings of the Symposium on Theory of Computing (STOC), pages 1–10, 2001.
C. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. Proceedings of the Symposium on Theory of Computing (STOL), 1987.
R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, New York, 1973.
P. Drineas, R. Kannan, A. Frieze, and V. Vinay. Clustering in large graphs and matrices. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 1999.
M. Ester, H. Kriegel, J. Snader, and X. Xu. A density-based algorithm for discovering clusters in large spatial database with noise. International Conference on Knowledge Discovery in Databases and Data Mining (KDD-96), 1996.
M. Ester, H. Kriegel, and X. Xu. A database interface for clustering in large spatial databases. International Conference on Knowledge Discovery in Databases and Data Mining (KDD-95), 1995.
J. H. Friedman, J. L. Bentley, and R. A. Finkel. An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3: 209–226, 1977.
Toms Feder and Daniel H. Greene. Optimal algorithms for appropriate clustering. Proceedings of the Symposium on Theory of Computing (STOC), pages 434–444, 1988.
V. Ganti, J. Gehrke, and R. Ramakrishnan. CACTUS—Clustering categorical data using summaries. International Conference on Knowledge Discovery in Databases and Data Mining (KDD-99), 1999.
D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamical systems. Proceedings of the 24’th International Conference on Very Large Data Bases, 1998.
S. Guha, H. Jagadish, N. Koudas, D. Srivastava, and T. Yu. Approximate xml joins. Proceedings of the Symposium on Management of Data (SIGMOD), pages 287–298, 2002.
S. Guha and S. Khuller. Greedy strikes back: Improved facility location algorithms. Proceedings of the 9th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 649–657, 1998.
S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2000.
S. Guha, R. Rastogi, and K. Shim. CURE: An efficient algorithm for clustering large databases. Proceedings of the Symposium on Management of Data (SIGMOD), 1998.
S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. Proceedings of ICDE, 1999.
T. F. Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical Computer Science, pages 293–306, 1985.
Sudipto Guha. Approximation algorithms for facility location problems. Ph.D. Thesis, Stanford University, 2000.
E. Han, G. Karypis, V. Kumar, and B. Mobasher. Clustering based on association rule hypergraphs. Technical report, 1997 SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, 1997.
D. Hochbaum and D. B. Shmoys. A best possible heuristic for the k-center problem. Math of Operations Research, 10 (2): 180–184, 1985.
P. Indyk and R. Motwani. Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. Proceedings of the Symposium on Theory of Computing (STOC), 1998.
Piotr Indyk, Rajeev Motwani, Prabhakar Raghavan, and Santosh Vem-pala. Locality-preserving hashing in multidimensional spaces. Proceedings of the Symposium on Theory of Computing (STOC), pages 618–625, 1997.
P. Indyk. Sublinear time algorithms for metric space problems. Proceedings of the Symposium on Theory of Computing, 1999.
Piotr Indyk. A sublinear time approximation scheme for clustering in metric spaces. Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 154–159, 1999.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988.
K. Jain, M. Mandian, and A. Saberi. A new greedy approach for facility location problem. Proceedings of the Symposium on Theory of Computing (STOC), 2002.
K. Jain and V. Vazirani. Primal-dual approximation algorithms for metric facility location and k-median problems. Proceedings of the Twenty-Ninth Annual IEEE Symposium on Foundations of Computer Science, 1999.
George Karypis, Rajat Aggarwal, Vipin Kumar, and Shashi Shekhar. Multilevel hypergraph partitioning: Application in VLSI domain. Proceedings of the ACM/IEEE Design Automation Conference, 1997.
O. Kariv and S. L. Hakimi. An algorithmic approach to network location problems, part ii: p-media ns. SIAM Journal on Applied Mathematics, pages 539–560, 1979.
Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. Efficient search for approximate nearest neighbor in high dimensional spaces. Proceedings of the Symposium on Theory of Computing (STOC), pages 614–623, 1998.
S. Kolliopoulos and S. Rao. A nearly linear-time approximation scheme for the euclidean k-median problem. Proc. 7th European Symposium on Algorithms, pages 378–389, 1999.
Ravi Kannan, Santosh Vempala, and Adrian Vetta. On clusterings: Good, bad and spectral. Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 367–377, 2000.
J. H. Lin and J. S. Vitter. Approximation algorithms for geometric median problems. Information Processing Letters, 44: 245–249, 1992.
J. H. Lin and J. S. Vitter. c-approximations with minimum packing constraint violations. Proceedings of the Twenty-Fourth Annual ACM Symposium on Theory of Computing, 1992.
O. L. Managasarian. Mathematical programming in data mining. Data Mining and Knowledge Discovery, 1997.
P. Mirchandani and R. Francis, editors. Discrete Location Theory. John Wiley and Sons, Inc., New York, 1990.
Nina Mishra, Dan Oblinger, and Leonard Pitt. Sublinear time approximate clustering. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2001.
R. Mettu and C. G. Plaxton. The onlike median problem. Proceedings of the 41st IEEE Foundations of Computer Science, 2000.
Ramgopal R. Mettu and C. Greg Plaxton. Optimal time bounds for approximate clustering. Manuscript, 2002.
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
S. Muthukrishnan. Efficient algorithms for document retrieval problems. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002.
Raymond T. Ng and Jiawei Han. Efficient and effective clustering methods for spatial data mining. Proceedings of the 20’th International Conference on Very Large Data Bases, 1994.
C. F. Olson. Parallel algorithms for hierarchical clustering. Technical report, University of California at Berkeley, 1993.
Liadan O’Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, and Rajeev Motwani. Streaming-data algorithms for high-quality clustering. Proceedings of ICDE, 2002.
Rafail Ostrovsky and Yuval Rabani. Polynomial time approximation schemes for geometric k-clustering. Proceedings of the Symposium on Foundations of Computer Science (FOCS), 2000.
Cecilia Procopiuc, Michael Jones, Pankaj K. Agarwal, and T. M. Murali. A monte cario algorithm for fast projective clustering. Proceedings of the Symposium on Management of Data (SIGMOD), 2002.
Hanan Samet. The Design and Analysis of Spatial Data Structures. Addison Wesley, 1990.
T. Sellis, N. Roussopoulos, and C. Faloutsos. The R+tree: a dynamic index for multi-dimensional objects. Proceedings of the 13th International Conference on Very Large Data Bases, pages 507–518, 1987.
Kyuseok Shim, Ramakrishnan Srikant, and Rakesh Agrawal. High-dimensional similarity joins. pages 301–311, 1997.
D. B. Shmoys, É. Tardos, and K. Aardal. Approximation algorithms for facility location problems. Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, pages 265–274, 1997.
Mikkel Thorup. Quick k-median, k-center, and facility location for sparse graphs. ICALP, pages 249–260, 2001.
H. Toivonen. Samping large databases for association rules. Proceedings of the International Conference on Very Large Databases (VLDB), 1996.
Vijay Vazirani. Approximation Algorithms. Springer Verlag, 2001.
J. S. Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11 (1): 37–57, 1985.
Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: An efficient data clustering method for very large databases. Proceedings of the ACM SIGMOD Conference on Management of Data, pages 103–114, 1996.
K. Zhang and D. Sasha. Tree pattern matching. In Apocolisto and Galil, editors, Pattern Matching Algorithms. Oxford University Press, 1997.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2004 Kluwer Academic Publishers
About this chapter
Cite this chapter
Guha, S., Rastogi, R., Shim, K. (2004). Techniques for Clustering Massive Data Sets. In: Clustering and Information Retrieval. Network Theory and Applications, vol 11. Springer, Boston, MA. https://doi.org/10.1007/978-1-4613-0227-8_2
Download citation
DOI: https://doi.org/10.1007/978-1-4613-0227-8_2
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4613-7949-2
Online ISBN: 978-1-4613-0227-8
eBook Packages: Springer Book Archive