Skip to main content

On Approximation Algorithms for Data Mining Applications

  • Chapter
Efficient Approximation and Online Algorithms

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3484))

Abstract

We aim to present current trends in the theoretical computer science research on topics which have applications in data mining. We briefly describe data mining tasks in various application contexts. We give an overview of some of the questions and algorithmic issues that are of concern when mining huge amounts of data that do not fit in main memory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: VLDB (2002)

    Google Scholar 

  2. Afrati, F., Gionis, A., Mannila, H.: Approximating a collection of frequent sets. In: KDD (2004)

    Google Scholar 

  3. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD (1998)

    Google Scholar 

  4. Agrawal, R., Imielinski, T., Swami, A.: Mining associations between sets of items in massive databases. In: SIGMOD, pp. 207–216 (1993)

    Google Scholar 

  5. Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery of association rules. In: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining (1996)

    Google Scholar 

  6. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: VLDB (1994)

    Google Scholar 

  7. Aiello, W., Chung, F., Lu, L.: A random graph model for power law graphs. In: STOC (2000)

    Google Scholar 

  8. Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating frequency moments. In: STOC, pp. 20–29 (1996)

    Google Scholar 

  9. AlSaid, N., Argyros, T., Ermopoulos, C., Paulaki, V.: Extracting cyber communities through patterns. In: SDM (2003)

    Google Scholar 

  10. Azar, Y., Fiat, A., Karlin, A., McSherry, F., Saia, J.: Spectral analysis of data. In: STOC, pp. 619–636 (2001)

    Google Scholar 

  11. Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS (2002)

    Google Scholar 

  12. Babcock, B., Datar, M., Motwani, R., O’Callaghan, L.: Maintaining variance and k-medians over data stream windows. In: PODS (2003)

    Google Scholar 

  13. Banfield, J., Raftery, A.: Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–821 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  14. Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Sampling algorithms: Lower bounds and applications. In: STOC (2001)

    Google Scholar 

  15. Ben-Dor, A., Yakhini, Z.: Clustering gene expression patterns. In: RECOMB (1999)

    Google Scholar 

  16. Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer, Heidelberg (1985)

    MATH  Google Scholar 

  17. Borodin, A., Ostrovsky, R., Rabani, Y.: Subquadratic approximation algorithms for clustering problems in high dimensional spaces. In: STOC (1999)

    Google Scholar 

  18. Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: KDD (1998)

    Google Scholar 

  19. Brin, S.: Extracting patterns and relations from the world-wide web (1998)

    Google Scholar 

  20. Brin, S., Motwani, R., Silverstein, C.: Beyond market baskets: Generalizing association rules to correlations. In: SIGMOD, pp. 265–276 (1997)

    Google Scholar 

  21. Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. In: SIGMOD, pp. 255–264 (1997)

    Google Scholar 

  22. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: WWW7/Computer Networks, pp. 107–117 (1998)

    Google Scholar 

  23. Brin, S., Page, L.: Dynamic data mining: Exploring large rule space by sampling (1998)

    Google Scholar 

  24. Broder, A.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences, pp. 21–29 (1997)

    Google Scholar 

  25. Broder, A., Charikar, M., Frieze, A., Mitzenmacher, M.: Min-wise independent permutations. In: STOC (1998)

    Google Scholar 

  26. Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the web. In: Sixth International World Wide We Conference, pp. 391–404 (1997)

    Google Scholar 

  27. Buhrman, H., de Wolf, R.: Complexity measures and decision tree complexity: A survey (1999), available at http://www.cwi.nl/~rdewolf

  28. Canetti, R., Even, G., Goldreich, O.: Lower bounds for sampling algorithms for estimating the average. Information Processing Letters 53, 17–25 (1995)

    Article  MATH  MathSciNet  Google Scholar 

  29. Chakrabarti, S., Dom, B., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Experiments in topic distillation. In: SIGIR workshop on hypertext information retrieval (1998)

    Google Scholar 

  30. Chang, C.C., Keisler, H.J.: Model Theory. North-Holland, Amsterdam (1990)

    MATH  Google Scholar 

  31. Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.: Towards estimation error guarantees for distinct values. In: PODS, pp. 268–279 (2000)

    Google Scholar 

  32. Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. In: STOC, pp. 626–635 (1997)

    Google Scholar 

  33. Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and Systems Sciences 55, 441–453 (1997)

    Article  MATH  Google Scholar 

  34. Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C.: Finding interesting associations without support pruning. TKDE 13(1) 2001 and also in ICDE, 64–78 (2000)

    Google Scholar 

  35. Cormode, G., Indyk, P., Koudas, N., Muthukrishnan, S.: Fast mining of massive tabular data via approximate distance computations. In: ICDE (2002)

    Google Scholar 

  36. Dagum, P., Karp, R., Luby, M., Ross, S.: An optimal algorithm for monte carlo estimation. In: FOCS, pp. 142–149 (1995)

    Google Scholar 

  37. Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. The American Society for Information Science 41(6), 391–407 (1990)

    Article  Google Scholar 

  38. Dobra, A., Garofalakis, M., Gehrke, J.: Sketch-based multi-query processing over data streams. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 551–568. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  39. Egghe, L., Rousseau, R.: Introduction to Informetrics. Elsevier, Amsterdam (1990)

    Google Scholar 

  40. Engebretsen, L., Indyk, P., O’Donnell, R.: Derandomized dimensionality reduction with applications. In: SODA (2002)

    Google Scholar 

  41. Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Second International Conference on Knoweledge Discovery and Data Mining, p. 226 (1996)

    Google Scholar 

  42. Fabrikant, A., Koutsoupias, E., Papadimitriou, C.H.: Heuristically optimized trade-offs: A new paradigm for power laws in the internet. In: STOC (2002)

    Google Scholar 

  43. Faloutsos, C.: Indexing and mining streams. In: SIGMOD (2004)

    Google Scholar 

  44. Fang, M., Shivakumar, N., Garcia-Molina, H., Motwani, R., Ullman, J.D.: Computing iceberg queries efficiently. In: VLDB (1998)

    Google Scholar 

  45. Fasulo, D.: An analysis of recent work on approximation algorithms. Technical Report 01-03-02, University of Washington, Dept. of Computer science and Engineering (1999)

    Google Scholar 

  46. Feigenbaum, J., Kannan, S., Strauss, M., Viswanathan, M.: An approximate l1-difference for massive data streams. In: FOCS (1999)

    Google Scholar 

  47. Feller, W.: An Introduction to Probability Theory and Its Applications. John Wiley, New York (1968)

    Google Scholar 

  48. Fujiwara, S., Ullman, J.D., Motwani, R.: Dynamic miss-counting algorithms: Finding implication and similarity rules with confidence pruning. In: ICDE, pp. 501–511 (2000)

    Google Scholar 

  49. Gangulya, S., Garofalakis, M., Rastogi, R.: Sketch-based processing data streams join aggregates using skimmed sketches. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 551–568. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  50. Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A.L., French, J.C.: Clustering large datasets in arbitrary metric spaces. In: ICDE, pp. 502–511 (1999)

    Google Scholar 

  51. Garofalakis, M., Gehrke, J., Rastogi, R.: Querying and mining data streams: You only get one look. In: VLDB (2002), also available at http://www.bell-labs.com/~minos

  52. Gibbons, P., Matias, Y.: Synopsis data structures for massive data sets. In: SODA, pp. S909–S910 (1999)

    Google Scholar 

  53. Gibbons, P., Tirthapura, S.: Estimating simple functions on the union of data streams. In: ACM Symposium on Parallel Algorithms and Architectures, pp. 281–291 (2001)

    Google Scholar 

  54. Gibson, D., Kleinberg, J.M., Raghavan, P.: Two algorithms for nearest neighbor search in high dimensions. In: STOC, vol. 8(3-4) (1997)

    Google Scholar 

  55. Gibson, D., Kleinberg, J.M., Raghavan, P.: Inferring web communities from link topology. In: ACM Conference on Hypertext and Hypermedia, vol. 8(3-4) (1998)

    Google Scholar 

  56. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 518–529 (1999)

    Google Scholar 

  57. Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: FOCS (2000)

    Google Scholar 

  58. Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. In: SIGMOD (1998)

    Google Scholar 

  59. Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining (Adaptive computation and machine learning). MIT Press, Cambridge (2001)

    Google Scholar 

  60. Haveliwala, T., Gionis, A., Klein, D., Indyk, P.: Similarity search on the web: Evaluation and scalable considerations. In: 11th International World Wide Web Conference (2002)

    Google Scholar 

  61. Henzinger, M.R., Raghavan, P., Rajagopalan, S.: Computing on data streams (1998), available at http://www.research.digital.com/SRC/

  62. Indyk, P.: Stable distributions, pseudorandom generators, embeddings and data stream computation. In: FOCS, pp. 189–197 (2000)

    Google Scholar 

  63. Indyk, P.: Algorithmic applications of low-distortion geometric embeddings. In: FOCS (2001)

    Google Scholar 

  64. Indyk, P., Koudas, N., Muthukrishnan, S.: Identifying representative trends in massive time series data sets using sketches. In: VLDB, pp. 363–372 (2000)

    Google Scholar 

  65. Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)

    Google Scholar 

  66. Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)

    MATH  Google Scholar 

  67. Johnson, W.B., Lindenstrauss, J.: Extensions of lipschitz mapping into hilbert space. Contemporary Mathematics 26, 189–206 (1984)

    MATH  MathSciNet  Google Scholar 

  68. Kannan, R., Vempala, S., Vetta, A.: On clusterings - good, bad and spectral. In: FOCS, pp. 367–377 (2000)

    Google Scholar 

  69. Karlin, A.R., Manasse, M.S., Rodolph, L., Sleator, D.D.: Competitive snoopy caching. In: STOC, pp. 70–119 (1988)

    Google Scholar 

  70. Kearns, M.J., Vazirani, U.V.: An introduction to comoputational learning theory. MIT Press, Cambridge (1994)

    Google Scholar 

  71. Kifer, D., Gehrke, J., Bucila, C., White, W.: How to quickly find a witness. In: PODS (2003)

    Google Scholar 

  72. Kleinberg, J.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  73. Kleinberg, J., Tomkins, A.: Applications of linear algebra in information retrieval and hypertext analysis. In: PODS, pp. 185–193 (1999)

    Google Scholar 

  74. Kleinberg, J.M., Papadimitriou, C.H., Raghavan, P.: A microeconomic view of data mining. Data Mining and Knowledge Discovery 2(4), 311–324 (1998)

    Article  Google Scholar 

  75. Kleinberg, J.M., Papadimitriou, C.H., Raghavan, P.: Segmentation problems. In: STOC, pp. 473–482 (1998)

    Google Scholar 

  76. Kumar, S.R., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web: experiments and models. In: International World Wide Web Conference, pp. 309–320 (2000)

    Google Scholar 

  77. Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling emerging cybercommunities automatically. In: International World Wide Web Conference, vol. 8(3-4) (1999)

    Google Scholar 

  78. Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A., Upfal, E.: Stochastic models for the web graph. In: FOCS, pp. 57–65 (2000)

    Google Scholar 

  79. Kushilevitz, E., Nisan, N.: Communication Complexity. Cambridge University Press, Cambridge (1997)

    MATH  Google Scholar 

  80. Mannila, H., Toivonen, H.: On an algorithm for finding all interesting sentences. In: Cybernetics and Systems, Volume II, The Thirteenth European Meeting on Cybernetics and Systems Research, pp. 973–978 (1996)

    Google Scholar 

  81. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)

    MATH  Google Scholar 

  82. Munro, J.L., Paterson, M.S.: Selection and sorting with limited storage. Theoretical Computer Science 12, 315–323 (1980)

    Article  MATH  MathSciNet  Google Scholar 

  83. Ng, R.T., Han, J.: Efficient and effective clustering methods for spatial data mining. In: VLDB, pp. 144–155 (1994)

    Google Scholar 

  84. Nisan, N.: Pseudorandom generators for pseudorandom computations. In: STOC, pp. 204–212 (1990)

    Google Scholar 

  85. Nolan, J.P.: An introduction to stable distributions, http://www.cas.american.edu/~jpnolan/chap1.ps

  86. O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: ICDE (2002)

    Google Scholar 

  87. Palmer, C., Faloutsos, C.: Density biased sampling: An improved method for data mining and clustering. In: SIGMOD, pp. 82–92 (2000)

    Google Scholar 

  88. Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent semantic indexing: A probabilistic analysis. JCSS 61(2), 217–235 (2000)

    MATH  MathSciNet  Google Scholar 

  89. Park, J.S., Chen, M.-S., Yu, P.S.: An effective hash-based algorithm for mining association rules. In: SIGMOD, pp. 175–186 (1995)

    Google Scholar 

  90. Saks, M., Sun, X.: Space lower bounds for distance approximation in the data stream model. In: STOC (2002)

    Google Scholar 

  91. Schulman, L., Vazirani, V.V.: Majorizing estimators and the approximation of #p-complete problems. In: STOC, pp. 288–294 (1999)

    Google Scholar 

  92. Silverstein, C., Brin, S., Motwani, R., Ullman, J.D.: Scalable techniques for mining causal structures. Data Mining and Knowledge Discovery 4(2/3), 163–192 (2000)

    Article  Google Scholar 

  93. STREAM. Stanford stream data management project, http://www-db.stanford.edu/stream

  94. Toivonen, H.: Sampling large databases for association rules. In: VLDB, pp. 134–145 (1996)

    Google Scholar 

  95. Ullman, J.D.: Lecture notes on data mining (2000), available at http://www-db.stanford.edu/~ullman/cs345-notes.html

  96. Vapnik, V.N.: Statistical learning theory. John Wiley, Chichester (1998)

    MATH  Google Scholar 

  97. Vazirani, V.V.: Approximation algorithms. Springer, Heidelberg (2001)

    Google Scholar 

  98. Vengroff, D.E., Vitter, J.S.: I/o efficient algorithms and environments. Computing Surveys, 212 (1996)

    Google Scholar 

  99. Vitter, J.: Random sampling with a reservoir. ACM Trans. on Mathematical Software 11(1), 37–57 (1985)

    Article  MATH  MathSciNet  Google Scholar 

  100. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: SIGMOD, pp. 103–114 (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Afrati, F.N. (2006). On Approximation Algorithms for Data Mining Applications. In: Bampis, E., Jansen, K., Kenyon, C. (eds) Efficient Approximation and Online Algorithms. Lecture Notes in Computer Science, vol 3484. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671541_1

Download citation

  • DOI: https://doi.org/10.1007/11671541_1

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-32212-2

  • Online ISBN: 978-3-540-32213-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics