On Approximation Algorithms for Data Mining Applications

Afrati, Foto N.

doi:10.1007/11671541_1

Foto N. Afrati¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 3484))

1061 Accesses
1 Citations

Abstract

We aim to present current trends in the theoretical computer science research on topics which have applications in data mining. We briefly describe data mining tasks in various application contexts. We give an overview of some of the questions and algorithmic issues that are of concern when mining huge amounts of data that do not fit in main memory.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: VLDB (2002)
Google Scholar
Afrati, F., Gionis, A., Mannila, H.: Approximating a collection of frequent sets. In: KDD (2004)
Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD (1998)
Google Scholar
Agrawal, R., Imielinski, T., Swami, A.: Mining associations between sets of items in massive databases. In: SIGMOD, pp. 207–216 (1993)
Google Scholar
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery of association rules. In: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining (1996)
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: VLDB (1994)
Google Scholar
Aiello, W., Chung, F., Lu, L.: A random graph model for power law graphs. In: STOC (2000)
Google Scholar
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating frequency moments. In: STOC, pp. 20–29 (1996)
Google Scholar
AlSaid, N., Argyros, T., Ermopoulos, C., Paulaki, V.: Extracting cyber communities through patterns. In: SDM (2003)
Google Scholar
Azar, Y., Fiat, A., Karlin, A., McSherry, F., Saia, J.: Spectral analysis of data. In: STOC, pp. 619–636 (2001)
Google Scholar
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS (2002)
Google Scholar
Babcock, B., Datar, M., Motwani, R., O’Callaghan, L.: Maintaining variance and k-medians over data stream windows. In: PODS (2003)
Google Scholar
Banfield, J., Raftery, A.: Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–821 (1993)
Article MATH MathSciNet Google Scholar
Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Sampling algorithms: Lower bounds and applications. In: STOC (2001)
Google Scholar
Ben-Dor, A., Yakhini, Z.: Clustering gene expression patterns. In: RECOMB (1999)
Google Scholar
Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer, Heidelberg (1985)
MATH Google Scholar
Borodin, A., Ostrovsky, R., Rabani, Y.: Subquadratic approximation algorithms for clustering problems in high dimensional spaces. In: STOC (1999)
Google Scholar
Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: KDD (1998)
Google Scholar
Brin, S.: Extracting patterns and relations from the world-wide web (1998)
Google Scholar
Brin, S., Motwani, R., Silverstein, C.: Beyond market baskets: Generalizing association rules to correlations. In: SIGMOD, pp. 265–276 (1997)
Google Scholar
Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. In: SIGMOD, pp. 255–264 (1997)
Google Scholar
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: WWW7/Computer Networks, pp. 107–117 (1998)
Google Scholar
Brin, S., Page, L.: Dynamic data mining: Exploring large rule space by sampling (1998)
Google Scholar
Broder, A.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences, pp. 21–29 (1997)
Google Scholar
Broder, A., Charikar, M., Frieze, A., Mitzenmacher, M.: Min-wise independent permutations. In: STOC (1998)
Google Scholar
Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the web. In: Sixth International World Wide We Conference, pp. 391–404 (1997)
Google Scholar
Buhrman, H., de Wolf, R.: Complexity measures and decision tree complexity: A survey (1999), available at http://www.cwi.nl/~rdewolf
Canetti, R., Even, G., Goldreich, O.: Lower bounds for sampling algorithms for estimating the average. Information Processing Letters 53, 17–25 (1995)
Article MATH MathSciNet Google Scholar
Chakrabarti, S., Dom, B., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Experiments in topic distillation. In: SIGIR workshop on hypertext information retrieval (1998)
Google Scholar
Chang, C.C., Keisler, H.J.: Model Theory. North-Holland, Amsterdam (1990)
MATH Google Scholar
Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.: Towards estimation error guarantees for distinct values. In: PODS, pp. 268–279 (2000)
Google Scholar
Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. In: STOC, pp. 626–635 (1997)
Google Scholar
Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and Systems Sciences 55, 441–453 (1997)
Article MATH Google Scholar
Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C.: Finding interesting associations without support pruning. TKDE 13(1) 2001 and also in ICDE, 64–78 (2000)
Google Scholar
Cormode, G., Indyk, P., Koudas, N., Muthukrishnan, S.: Fast mining of massive tabular data via approximate distance computations. In: ICDE (2002)
Google Scholar
Dagum, P., Karp, R., Luby, M., Ross, S.: An optimal algorithm for monte carlo estimation. In: FOCS, pp. 142–149 (1995)
Google Scholar
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. The American Society for Information Science 41(6), 391–407 (1990)
Article Google Scholar
Dobra, A., Garofalakis, M., Gehrke, J.: Sketch-based multi-query processing over data streams. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 551–568. Springer, Heidelberg (2004)
Chapter Google Scholar
Egghe, L., Rousseau, R.: Introduction to Informetrics. Elsevier, Amsterdam (1990)
Google Scholar
Engebretsen, L., Indyk, P., O’Donnell, R.: Derandomized dimensionality reduction with applications. In: SODA (2002)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Second International Conference on Knoweledge Discovery and Data Mining, p. 226 (1996)
Google Scholar
Fabrikant, A., Koutsoupias, E., Papadimitriou, C.H.: Heuristically optimized trade-offs: A new paradigm for power laws in the internet. In: STOC (2002)
Google Scholar
Faloutsos, C.: Indexing and mining streams. In: SIGMOD (2004)
Google Scholar
Fang, M., Shivakumar, N., Garcia-Molina, H., Motwani, R., Ullman, J.D.: Computing iceberg queries efficiently. In: VLDB (1998)
Google Scholar
Fasulo, D.: An analysis of recent work on approximation algorithms. Technical Report 01-03-02, University of Washington, Dept. of Computer science and Engineering (1999)
Google Scholar
Feigenbaum, J., Kannan, S., Strauss, M., Viswanathan, M.: An approximate l1-difference for massive data streams. In: FOCS (1999)
Google Scholar
Feller, W.: An Introduction to Probability Theory and Its Applications. John Wiley, New York (1968)
Google Scholar
Fujiwara, S., Ullman, J.D., Motwani, R.: Dynamic miss-counting algorithms: Finding implication and similarity rules with confidence pruning. In: ICDE, pp. 501–511 (2000)
Google Scholar
Gangulya, S., Garofalakis, M., Rastogi, R.: Sketch-based processing data streams join aggregates using skimmed sketches. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 551–568. Springer, Heidelberg (2004)
Chapter Google Scholar
Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A.L., French, J.C.: Clustering large datasets in arbitrary metric spaces. In: ICDE, pp. 502–511 (1999)
Google Scholar
Garofalakis, M., Gehrke, J., Rastogi, R.: Querying and mining data streams: You only get one look. In: VLDB (2002), also available at http://www.bell-labs.com/~minos
Gibbons, P., Matias, Y.: Synopsis data structures for massive data sets. In: SODA, pp. S909–S910 (1999)
Google Scholar
Gibbons, P., Tirthapura, S.: Estimating simple functions on the union of data streams. In: ACM Symposium on Parallel Algorithms and Architectures, pp. 281–291 (2001)
Google Scholar
Gibson, D., Kleinberg, J.M., Raghavan, P.: Two algorithms for nearest neighbor search in high dimensions. In: STOC, vol. 8(3-4) (1997)
Google Scholar
Gibson, D., Kleinberg, J.M., Raghavan, P.: Inferring web communities from link topology. In: ACM Conference on Hypertext and Hypermedia, vol. 8(3-4) (1998)
Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 518–529 (1999)
Google Scholar
Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: FOCS (2000)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. In: SIGMOD (1998)
Google Scholar
Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining (Adaptive computation and machine learning). MIT Press, Cambridge (2001)
Google Scholar
Haveliwala, T., Gionis, A., Klein, D., Indyk, P.: Similarity search on the web: Evaluation and scalable considerations. In: 11th International World Wide Web Conference (2002)
Google Scholar
Henzinger, M.R., Raghavan, P., Rajagopalan, S.: Computing on data streams (1998), available at http://www.research.digital.com/SRC/
Indyk, P.: Stable distributions, pseudorandom generators, embeddings and data stream computation. In: FOCS, pp. 189–197 (2000)
Google Scholar
Indyk, P.: Algorithmic applications of low-distortion geometric embeddings. In: FOCS (2001)
Google Scholar
Indyk, P., Koudas, N., Muthukrishnan, S.: Identifying representative trends in massive time series data sets using sketches. In: VLDB, pp. 363–372 (2000)
Google Scholar
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
MATH Google Scholar
Johnson, W.B., Lindenstrauss, J.: Extensions of lipschitz mapping into hilbert space. Contemporary Mathematics 26, 189–206 (1984)
MATH MathSciNet Google Scholar
Kannan, R., Vempala, S., Vetta, A.: On clusterings - good, bad and spectral. In: FOCS, pp. 367–377 (2000)
Google Scholar
Karlin, A.R., Manasse, M.S., Rodolph, L., Sleator, D.D.: Competitive snoopy caching. In: STOC, pp. 70–119 (1988)
Google Scholar
Kearns, M.J., Vazirani, U.V.: An introduction to comoputational learning theory. MIT Press, Cambridge (1994)
Google Scholar
Kifer, D., Gehrke, J., Bucila, C., White, W.: How to quickly find a witness. In: PODS (2003)
Google Scholar
Kleinberg, J.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Article MATH MathSciNet Google Scholar
Kleinberg, J., Tomkins, A.: Applications of linear algebra in information retrieval and hypertext analysis. In: PODS, pp. 185–193 (1999)
Google Scholar
Kleinberg, J.M., Papadimitriou, C.H., Raghavan, P.: A microeconomic view of data mining. Data Mining and Knowledge Discovery 2(4), 311–324 (1998)
Article Google Scholar
Kleinberg, J.M., Papadimitriou, C.H., Raghavan, P.: Segmentation problems. In: STOC, pp. 473–482 (1998)
Google Scholar
Kumar, S.R., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web: experiments and models. In: International World Wide Web Conference, pp. 309–320 (2000)
Google Scholar
Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling emerging cybercommunities automatically. In: International World Wide Web Conference, vol. 8(3-4) (1999)
Google Scholar
Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A., Upfal, E.: Stochastic models for the web graph. In: FOCS, pp. 57–65 (2000)
Google Scholar
Kushilevitz, E., Nisan, N.: Communication Complexity. Cambridge University Press, Cambridge (1997)
MATH Google Scholar
Mannila, H., Toivonen, H.: On an algorithm for finding all interesting sentences. In: Cybernetics and Systems, Volume II, The Thirteenth European Meeting on Cybernetics and Systems Research, pp. 973–978 (1996)
Google Scholar
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)
MATH Google Scholar
Munro, J.L., Paterson, M.S.: Selection and sorting with limited storage. Theoretical Computer Science 12, 315–323 (1980)
Article MATH MathSciNet Google Scholar
Ng, R.T., Han, J.: Efficient and effective clustering methods for spatial data mining. In: VLDB, pp. 144–155 (1994)
Google Scholar
Nisan, N.: Pseudorandom generators for pseudorandom computations. In: STOC, pp. 204–212 (1990)
Google Scholar
Nolan, J.P.: An introduction to stable distributions, http://www.cas.american.edu/~jpnolan/chap1.ps
O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: ICDE (2002)
Google Scholar
Palmer, C., Faloutsos, C.: Density biased sampling: An improved method for data mining and clustering. In: SIGMOD, pp. 82–92 (2000)
Google Scholar
Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent semantic indexing: A probabilistic analysis. JCSS 61(2), 217–235 (2000)
MATH MathSciNet Google Scholar
Park, J.S., Chen, M.-S., Yu, P.S.: An effective hash-based algorithm for mining association rules. In: SIGMOD, pp. 175–186 (1995)
Google Scholar
Saks, M., Sun, X.: Space lower bounds for distance approximation in the data stream model. In: STOC (2002)
Google Scholar
Schulman, L., Vazirani, V.V.: Majorizing estimators and the approximation of #p-complete problems. In: STOC, pp. 288–294 (1999)
Google Scholar
Silverstein, C., Brin, S., Motwani, R., Ullman, J.D.: Scalable techniques for mining causal structures. Data Mining and Knowledge Discovery 4(2/3), 163–192 (2000)
Article Google Scholar
STREAM. Stanford stream data management project, http://www-db.stanford.edu/stream
Toivonen, H.: Sampling large databases for association rules. In: VLDB, pp. 134–145 (1996)
Google Scholar
Ullman, J.D.: Lecture notes on data mining (2000), available at http://www-db.stanford.edu/~ullman/cs345-notes.html
Vapnik, V.N.: Statistical learning theory. John Wiley, Chichester (1998)
MATH Google Scholar
Vazirani, V.V.: Approximation algorithms. Springer, Heidelberg (2001)
Google Scholar
Vengroff, D.E., Vitter, J.S.: I/o efficient algorithms and environments. Computing Surveys, 212 (1996)
Google Scholar
Vitter, J.: Random sampling with a reservoir. ACM Trans. on Mathematical Software 11(1), 37–57 (1985)
Article MATH MathSciNet Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: SIGMOD, pp. 103–114 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

National Technical University of Athens, Greece
Foto N. Afrati

Authors

Foto N. Afrati
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

IBISC, CNRS FRE 3190, Université d’Evry Val d’Essonne, Boulevard François Mitterand, 91025, Evry Cedex, France
Evripidis Bampis
Institute for Computer Science, University of Kiel, Olshausenstrasse 40, 24118, Kiel, Germany
Klaus Jansen
Brown University, 02918, Providence, RI, USA
Claire Kenyon

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Afrati, F.N. (2006). On Approximation Algorithms for Data Mining Applications. In: Bampis, E., Jansen, K., Kenyon, C. (eds) Efficient Approximation and Online Algorithms. Lecture Notes in Computer Science, vol 3484. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671541_1

Download citation

DOI: https://doi.org/10.1007/11671541_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32212-2
Online ISBN: 978-3-540-32213-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics