CoDS: A Representative Sampling Method for Relational Databases

  • Teodora Sandra Buda
  • Thomas Cerqueus
  • John Murphy
  • Morten Kristiansen
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8055)


Database sampling has become a popular approach to handle large amounts of data in a wide range of application areas such as data mining or approximate query evaluation. Using database samples is a potential solution when using the entire database is not cost-effective, and a balance between the accuracy of the results and the computational cost of the process applied on the large data set is preferred. Existing sampling approaches are either limited to specific application areas, to single table databases, or to random sampling. In this paper, we propose CoDS: a novel sampling approach targeting relational databases that ensures that the sample database follows the same distribution for specific fields as the original database. In particular it aims to maintain the distribution between tables. We evaluate the performance of our algorithm by measuring the representativeness of the sample with respect to the original database. We compare our approach with two existing solutions, and we show that our method performs faster and produces better results in terms of representativeness.


Relational database Representative database sampling 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Acharya, S., Gibbons, P.B., Poosala, V., Ramaswamy, S.: Join synopses for approximate query answering. In: International Conference on Management of Data (SIGMOD), pp. 275–286 (1999)Google Scholar
  2. 2.
    Agarwal, S., Iyer, A.P., Panda, A., Madden, S., Mozafari, B., Stoica, I.: Blink and it’s done: interactive queries on very large data. VLDB Endowment 5(12), 1902–1905 (2012)Google Scholar
  3. 3.
    Bisbal, J., Grimson, J., Bell, D.: A formal framework for database sampling. Information and Software Technology 47(12), 819–828 (2005)CrossRefGoogle Scholar
  4. 4.
    Chakaravarthy, V.T., Pandit, V., Sabharwal, Y.: Analysis of sampling techniques for association rule mining. In: 12th ACM International Conference on Database Theory (ICST), pp. 276–283 (2009)Google Scholar
  5. 5.
    Chaudhuri, S., Das, G., Srivastava, U.: Effective use of block-level sampling in statistics estimation. In: ACM International Conference on Management of Data (SIGMOD), pp. 287–298 (2004)Google Scholar
  6. 6.
    Ferragut, E., Laska, J.: Randomized sampling for large data applications of SVM. In: 11th IEEE International Conference on Machine Learning and Applications (ICMLA), vol. 1, pp. 350–355 (2012)Google Scholar
  7. 7.
    Gemulla, R., Rösch, P., Lehner, W.: Linked bernoulli synopses: Sampling along foreign keys. In: Ludäscher, B., Mamoulis, N. (eds.) SSDBM 2008. LNCS, vol. 5069, pp. 6–23. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  8. 8.
    Goethals, B., Le Page, W., Mampaey, M.: Mining interesting sets and rules in relational databases. In: 25th ACM Symposium on Applied Computing (SAC), pp. 997–1001 (2010)Google Scholar
  9. 9.
    Haas, P.J., König, C.: A bi-level bernoulli scheme for database sampling. In: ACM International Conference on Management of Data (SIGMOD), pp. 275–286 (2004)Google Scholar
  10. 10.
    Ioannidis, Y.E., Poosala, V.: Histogram-based approximation of set-valued query-answers. In: 25th International Conference on Very Large Data Bases (VLDB), pp. 174–185 (1999)Google Scholar
  11. 11.
    John, G., Langley, P.: Static versus dynamic sampling for data mining. In: 2nd International Conference on Knowledge Discovery and Data Mining (KDD), pp. 367–370 (1996)Google Scholar
  12. 12.
    Köhler, H., Zhou, X., Sadiq, S., Shu, Y., Taylor, K.: Sampling dirty data for matching attributes. In: ACM International Conference on Management of Data (SIGMOD), pp. 63–74 (2010)Google Scholar
  13. 13.
    Lu, X., Bressan, S.: Sampling connected induced subgraphs uniformly at random. In: Ailamaki, A., Bowers, S. (eds.) SSDBM 2012. LNCS, vol. 7338, pp. 195–212. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Olken, F.: Random Sampling from Databases. PhD thesis, University of California at Berkeley (1993)Google Scholar
  15. 15.
    Olston, C., Chopra, S., Srivastava, U.: Generating example data for dataflow programs. In: International Conference on Management of Data, pp. 245–256 (2009)Google Scholar
  16. 16.
    Palmer, C.R., Faloutsos, C.: Density biased sampling: an improved method for data mining and clustering. In: ACM International Conference on Management of Data (SIGMOD), pp. 82–92 (2000)Google Scholar
  17. 17.
    Provost, F., Jensen, D., Oates, T.: Efficient progressive sampling. In: 5th ACM International Conference on Knowledge Discovery and Data Mining (KDD), pp. 23–32 (1999)Google Scholar
  18. 18.
    Taneja, K., Zhang, Y., Xie, T.: MODA: Automated test generation for database applications via mock objects. In: IEEE/ACM International Conference on Automated Software Engineering (2010)Google Scholar
  19. 19.
    Toivonen, H.: Sampling large databases for association rules. In: 22nd International Conference on Very Large Data Bases, VLDB (1996)Google Scholar
  20. 20.
    Wu, X., Wang, Y., Guo, S., Zheng, Y.: Privacy preserving database generation for database application testing. Fundamenta Informaticae 78(4), 595–612 (2007)MathSciNetzbMATHGoogle Scholar
  21. 21.
    Yin, X., Han, J., Yang, J., Yu, P.: Efficient classification across multiple database relations: a crossmine approach. IEEE Transactions on Knowledge and Data Engineering (TKDE) 18(6), 770–783 (2006)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Teodora Sandra Buda
    • 1
  • Thomas Cerqueus
    • 1
  • John Murphy
    • 1
  • Morten Kristiansen
    • 2
  1. 1.Lero, Performance Engineering Lab School of Computer Science and InformaticsUniversity College DublinIreland
  2. 2.IBM Software GroupIBM Collaboration SolutionsDublinIreland

Personalised recommendations