Abstract
We aim to present current trends in the theoretical computer science research on topics which have applications in data mining. We briefly describe data mining tasks in various application contexts. We give an overview of some of the questions and algorithmic issues that are of concern when mining huge amounts of data that do not fit in main memory.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abiteboul, S., Preda, M., Cobena, G.: Adaptive on-line page importance computation. In: VLDB (2002)
Afrati, F., Gionis, A., Mannila, H.: Approximating a collection of frequent sets. In: KDD (2004)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD (1998)
Agrawal, R., Imielinski, T., Swami, A.: Mining associations between sets of items in massive databases. In: SIGMOD, pp. 207–216 (1993)
Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., Verkamo, A.I.: Fast discovery of association rules. In: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining (1996)
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: VLDB (1994)
Aiello, W., Chung, F., Lu, L.: A random graph model for power law graphs. In: STOC (2000)
Alon, N., Matias, Y., Szegedy, M.: The space complexity of approximating frequency moments. In: STOC, pp. 20–29 (1996)
AlSaid, N., Argyros, T., Ermopoulos, C., Paulaki, V.: Extracting cyber communities through patterns. In: SDM (2003)
Azar, Y., Fiat, A., Karlin, A., McSherry, F., Saia, J.: Spectral analysis of data. In: STOC, pp. 619–636 (2001)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: PODS (2002)
Babcock, B., Datar, M., Motwani, R., O’Callaghan, L.: Maintaining variance and k-medians over data stream windows. In: PODS (2003)
Banfield, J., Raftery, A.: Model-based gaussian and non-gaussian clustering. Biometrics 49, 803–821 (1993)
Bar-Yossef, Z., Kumar, R., Sivakumar, D.: Sampling algorithms: Lower bounds and applications. In: STOC (2001)
Ben-Dor, A., Yakhini, Z.: Clustering gene expression patterns. In: RECOMB (1999)
Berger, J.O.: Statistical Decision Theory and Bayesian Analysis. Springer, Heidelberg (1985)
Borodin, A., Ostrovsky, R., Rabani, Y.: Subquadratic approximation algorithms for clustering problems in high dimensional spaces. In: STOC (1999)
Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: KDD (1998)
Brin, S.: Extracting patterns and relations from the world-wide web (1998)
Brin, S., Motwani, R., Silverstein, C.: Beyond market baskets: Generalizing association rules to correlations. In: SIGMOD, pp. 265–276 (1997)
Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. In: SIGMOD, pp. 255–264 (1997)
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: WWW7/Computer Networks, pp. 107–117 (1998)
Brin, S., Page, L.: Dynamic data mining: Exploring large rule space by sampling (1998)
Broder, A.: On the resemblance and containment of documents. In: Compression and Complexity of Sequences, pp. 21–29 (1997)
Broder, A., Charikar, M., Frieze, A., Mitzenmacher, M.: Min-wise independent permutations. In: STOC (1998)
Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the web. In: Sixth International World Wide We Conference, pp. 391–404 (1997)
Buhrman, H., de Wolf, R.: Complexity measures and decision tree complexity: A survey (1999), available at http://www.cwi.nl/~rdewolf
Canetti, R., Even, G., Goldreich, O.: Lower bounds for sampling algorithms for estimating the average. Information Processing Letters 53, 17–25 (1995)
Chakrabarti, S., Dom, B., Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Experiments in topic distillation. In: SIGIR workshop on hypertext information retrieval (1998)
Chang, C.C., Keisler, H.J.: Model Theory. North-Holland, Amsterdam (1990)
Charikar, M., Chaudhuri, S., Motwani, R., Narasayya, V.: Towards estimation error guarantees for distinct values. In: PODS, pp. 268–279 (2000)
Charikar, M., Chekuri, C., Feder, T., Motwani, R.: Incremental clustering and dynamic information retrieval. In: STOC, pp. 626–635 (1997)
Cohen, E.: Size-estimation framework with applications to transitive closure and reachability. Journal of Computer and Systems Sciences 55, 441–453 (1997)
Cohen, E., Datar, M., Fujiwara, S., Gionis, A., Indyk, P., Motwani, R., Ullman, J.D., Yang, C.: Finding interesting associations without support pruning. TKDE 13(1) 2001 and also in ICDE, 64–78 (2000)
Cormode, G., Indyk, P., Koudas, N., Muthukrishnan, S.: Fast mining of massive tabular data via approximate distance computations. In: ICDE (2002)
Dagum, P., Karp, R., Luby, M., Ross, S.: An optimal algorithm for monte carlo estimation. In: FOCS, pp. 142–149 (1995)
Deerwester, S., Dumais, S., Furnas, G., Landauer, T., Harshman, R.: Indexing by latent semantic analysis. The American Society for Information Science 41(6), 391–407 (1990)
Dobra, A., Garofalakis, M., Gehrke, J.: Sketch-based multi-query processing over data streams. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 551–568. Springer, Heidelberg (2004)
Egghe, L., Rousseau, R.: Introduction to Informetrics. Elsevier, Amsterdam (1990)
Engebretsen, L., Indyk, P., O’Donnell, R.: Derandomized dimensionality reduction with applications. In: SODA (2002)
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Second International Conference on Knoweledge Discovery and Data Mining, p. 226 (1996)
Fabrikant, A., Koutsoupias, E., Papadimitriou, C.H.: Heuristically optimized trade-offs: A new paradigm for power laws in the internet. In: STOC (2002)
Faloutsos, C.: Indexing and mining streams. In: SIGMOD (2004)
Fang, M., Shivakumar, N., Garcia-Molina, H., Motwani, R., Ullman, J.D.: Computing iceberg queries efficiently. In: VLDB (1998)
Fasulo, D.: An analysis of recent work on approximation algorithms. Technical Report 01-03-02, University of Washington, Dept. of Computer science and Engineering (1999)
Feigenbaum, J., Kannan, S., Strauss, M., Viswanathan, M.: An approximate l1-difference for massive data streams. In: FOCS (1999)
Feller, W.: An Introduction to Probability Theory and Its Applications. John Wiley, New York (1968)
Fujiwara, S., Ullman, J.D., Motwani, R.: Dynamic miss-counting algorithms: Finding implication and similarity rules with confidence pruning. In: ICDE, pp. 501–511 (2000)
Gangulya, S., Garofalakis, M., Rastogi, R.: Sketch-based processing data streams join aggregates using skimmed sketches. In: Bertino, E., Christodoulakis, S., Plexousakis, D., Christophides, V., Koubarakis, M., Böhm, K., Ferrari, E. (eds.) EDBT 2004. LNCS, vol. 2992, pp. 551–568. Springer, Heidelberg (2004)
Ganti, V., Ramakrishnan, R., Gehrke, J., Powell, A.L., French, J.C.: Clustering large datasets in arbitrary metric spaces. In: ICDE, pp. 502–511 (1999)
Garofalakis, M., Gehrke, J., Rastogi, R.: Querying and mining data streams: You only get one look. In: VLDB (2002), also available at http://www.bell-labs.com/~minos
Gibbons, P., Matias, Y.: Synopsis data structures for massive data sets. In: SODA, pp. S909–S910 (1999)
Gibbons, P., Tirthapura, S.: Estimating simple functions on the union of data streams. In: ACM Symposium on Parallel Algorithms and Architectures, pp. 281–291 (2001)
Gibson, D., Kleinberg, J.M., Raghavan, P.: Two algorithms for nearest neighbor search in high dimensions. In: STOC, vol. 8(3-4) (1997)
Gibson, D., Kleinberg, J.M., Raghavan, P.: Inferring web communities from link topology. In: ACM Conference on Hypertext and Hypermedia, vol. 8(3-4) (1998)
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, pp. 518–529 (1999)
Guha, S., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams. In: FOCS (2000)
Guha, S., Rastogi, R., Shim, K.: Cure: An efficient clustering algorithm for large databases. In: SIGMOD (1998)
Hand, D.J., Mannila, H., Smyth, P.: Principles of Data Mining (Adaptive computation and machine learning). MIT Press, Cambridge (2001)
Haveliwala, T., Gionis, A., Klein, D., Indyk, P.: Similarity search on the web: Evaluation and scalable considerations. In: 11th International World Wide Web Conference (2002)
Henzinger, M.R., Raghavan, P., Rajagopalan, S.: Computing on data streams (1998), available at http://www.research.digital.com/SRC/
Indyk, P.: Stable distributions, pseudorandom generators, embeddings and data stream computation. In: FOCS, pp. 189–197 (2000)
Indyk, P.: Algorithmic applications of low-distortion geometric embeddings. In: FOCS (2001)
Indyk, P., Koudas, N., Muthukrishnan, S.: Identifying representative trends in massive time series data sets using sketches. In: VLDB, pp. 363–372 (2000)
Indyk, P., Motwani, R.: Approximate nearest neighbors: Towards removing the curse of dimensionality. In: STOC, pp. 604–613 (1998)
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)
Johnson, W.B., Lindenstrauss, J.: Extensions of lipschitz mapping into hilbert space. Contemporary Mathematics 26, 189–206 (1984)
Kannan, R., Vempala, S., Vetta, A.: On clusterings - good, bad and spectral. In: FOCS, pp. 367–377 (2000)
Karlin, A.R., Manasse, M.S., Rodolph, L., Sleator, D.D.: Competitive snoopy caching. In: STOC, pp. 70–119 (1988)
Kearns, M.J., Vazirani, U.V.: An introduction to comoputational learning theory. MIT Press, Cambridge (1994)
Kifer, D., Gehrke, J., Bucila, C., White, W.: How to quickly find a witness. In: PODS (2003)
Kleinberg, J.: Authoritative sources in a hyperlinked environment. J. ACM 46(5), 604–632 (1999)
Kleinberg, J., Tomkins, A.: Applications of linear algebra in information retrieval and hypertext analysis. In: PODS, pp. 185–193 (1999)
Kleinberg, J.M., Papadimitriou, C.H., Raghavan, P.: A microeconomic view of data mining. Data Mining and Knowledge Discovery 2(4), 311–324 (1998)
Kleinberg, J.M., Papadimitriou, C.H., Raghavan, P.: Segmentation problems. In: STOC, pp. 473–482 (1998)
Kumar, S.R., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web: experiments and models. In: International World Wide Web Conference, pp. 309–320 (2000)
Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling emerging cybercommunities automatically. In: International World Wide Web Conference, vol. 8(3-4) (1999)
Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A., Upfal, E.: Stochastic models for the web graph. In: FOCS, pp. 57–65 (2000)
Kushilevitz, E., Nisan, N.: Communication Complexity. Cambridge University Press, Cambridge (1997)
Mannila, H., Toivonen, H.: On an algorithm for finding all interesting sentences. In: Cybernetics and Systems, Volume II, The Thirteenth European Meeting on Cybernetics and Systems Research, pp. 973–978 (1996)
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (1995)
Munro, J.L., Paterson, M.S.: Selection and sorting with limited storage. Theoretical Computer Science 12, 315–323 (1980)
Ng, R.T., Han, J.: Efficient and effective clustering methods for spatial data mining. In: VLDB, pp. 144–155 (1994)
Nisan, N.: Pseudorandom generators for pseudorandom computations. In: STOC, pp. 204–212 (1990)
Nolan, J.P.: An introduction to stable distributions, http://www.cas.american.edu/~jpnolan/chap1.ps
O’Callaghan, L., Mishra, N., Meyerson, A., Guha, S., Motwani, R.: Streaming-data algorithms for high-quality clustering. In: ICDE (2002)
Palmer, C., Faloutsos, C.: Density biased sampling: An improved method for data mining and clustering. In: SIGMOD, pp. 82–92 (2000)
Papadimitriou, C.H., Raghavan, P., Tamaki, H., Vempala, S.: Latent semantic indexing: A probabilistic analysis. JCSS 61(2), 217–235 (2000)
Park, J.S., Chen, M.-S., Yu, P.S.: An effective hash-based algorithm for mining association rules. In: SIGMOD, pp. 175–186 (1995)
Saks, M., Sun, X.: Space lower bounds for distance approximation in the data stream model. In: STOC (2002)
Schulman, L., Vazirani, V.V.: Majorizing estimators and the approximation of #p-complete problems. In: STOC, pp. 288–294 (1999)
Silverstein, C., Brin, S., Motwani, R., Ullman, J.D.: Scalable techniques for mining causal structures. Data Mining and Knowledge Discovery 4(2/3), 163–192 (2000)
STREAM. Stanford stream data management project, http://www-db.stanford.edu/stream
Toivonen, H.: Sampling large databases for association rules. In: VLDB, pp. 134–145 (1996)
Ullman, J.D.: Lecture notes on data mining (2000), available at http://www-db.stanford.edu/~ullman/cs345-notes.html
Vapnik, V.N.: Statistical learning theory. John Wiley, Chichester (1998)
Vazirani, V.V.: Approximation algorithms. Springer, Heidelberg (2001)
Vengroff, D.E., Vitter, J.S.: I/o efficient algorithms and environments. Computing Surveys, 212 (1996)
Vitter, J.: Random sampling with a reservoir. ACM Trans. on Mathematical Software 11(1), 37–57 (1985)
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: SIGMOD, pp. 103–114 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Afrati, F.N. (2006). On Approximation Algorithms for Data Mining Applications. In: Bampis, E., Jansen, K., Kenyon, C. (eds) Efficient Approximation and Online Algorithms. Lecture Notes in Computer Science, vol 3484. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11671541_1
Download citation
DOI: https://doi.org/10.1007/11671541_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-32212-2
Online ISBN: 978-3-540-32213-9
eBook Packages: Computer ScienceComputer Science (R0)