Random sampling from database files: A survey

  • Frank Olken
  • Doron Rotem
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 420)


In this paper we survey known results on algorithms, data structures, and some applications of random sampling from databases. We first discuss various reasons for sampling from databases, and for inclusion of sampling as a DBMS operator. We consider basic sampling algorithms, sampling from trees, sampling from hash tables, and auxiliary memory resident index information to facilitate sampling.


Internal Node Acceptance Probability Query Evaluation Inclusion Probability Disk Access 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [Ark84]
    Herbert Arkin. Handbook of Sampling for Auditing and Accounting. McGraw-Hill, 1984.Google Scholar
  2. [BK75]
    B.T. Bennett and V.J. Kruskal. Lru stack processing. IBM Journal of Research and Development, 19(4):353–357, July 1975.Google Scholar
  3. [Coc77]
    William G. Cochran. Sampling Techniques. Wiley, 1977.Google Scholar
  4. [Den80]
    Dorothy E. Denning. Secure statistical databases with random sample queries. ACM Transactions on Database Systems, 5(3):291–35, Sept. 1980.Google Scholar
  5. [EN82]
    Jarmo Ernvall and Olli Nevalainen. An algorithm for unbiased random sampling. The Computer Journal, 25(1), 1982.Google Scholar
  6. [FMR62]
    C.T. Fan, M.E. Muller, and I. Rezucha. Development of sampling plans by using sequential (item by item) selection techniques and digital computers. Journal of the American Statistical Association, 57:387–402, June 1962.Google Scholar
  7. [Gho86]
    S. Ghosh. Siam: Statistics information access method. In Proceedings of the Third International Workshop on Statistical and Scientific Database Management, pages 286–293. EUROSTAT, Luxembourg, 1986.Google Scholar
  8. [HOT88]
    Wen-Chi Hou, Gultekin Ozsoyoglu, and Baldeo K. Taneja. Statistical estimators for relational algebra expressions. In Proceedings of the Seventh ACM Conference on Principles of Database Systems, pages 288–293, March 1988.Google Scholar
  9. [HOT89]
    Wen-Chi Hou, Gultekin Ozsoyoglu, and Baldeo K. Taneja. Processing aggregate relational queries with hard time constraints. In ACM SIGMOD International Conference on the Management of Data, pages 68–77, June 1989.Google Scholar
  10. [Knu73]
    Donald Ervin Knuth. The Art of Computer Programming: Vol. 3, Sorting and Searching. Addison-Wesley, 1973.Google Scholar
  11. [Lar80]
    P.-A. Larson. Linear hashing with partial expansions. In Proceedings of the Sixth International Conference on Very Large Databases (VLDB), pages 224–232, 1980.Google Scholar
  12. [Lit80]
    W. Litwin. Linear hashing: a new tool for file and table addressing. In Proceedings of the Sixth International Conference on Very Large Databases (VLDB), pages 212–223, 1980.Google Scholar
  13. [LTA79]
    Donald A. Leslie, Albert D. Teitlebaum, and Rodney J. Anderson. Dollar Unit Sampling. Copp Clark Pitmanan, 1979.Google Scholar
  14. [LWW84]
    H.-J. Lenz, G.B. Wetherill, and P.-Th. Wilrich, editors. Frontiers in Statistical Quality Control 2. Physica-Verlag, Wurzburg, Germany, 1984.Google Scholar
  15. [Mon85]
    Douglas C. Montogmery. Introduction to Statistical Quality Control. Wiley, 1985.Google Scholar
  16. [Mor80]
    Jacob Morgenstein. Computer Based Management Information Systems Embodying Answer Accuracy as a User Parameter. PhD thesis, Univ. of California, Berkeley, December 1980.Google Scholar
  17. [NHS84]
    J. Nievergelt, H. Hinterberger, and K.C. Sevcik. The grid file: An adaptable, symmetric multkey structure. ACM Transactions on Database Systems, 9(1):38–71, March 1984.Google Scholar
  18. [OR]
    Frank Olken and Doron Rotem. Random sampling from b + trees.Google Scholar
  19. [OR86]
    Frank Olken and Doron Rotem. Simple random sampling from relational databases. In Proceedings of the Twelfth International Conference on Very Large Databases (VLDB), pages 160–169, August 1986.Google Scholar
  20. [Pal85]
    P. Palvia. Expressions for batched searching of sequential and hierarchical files. ACM Transactions on Database Systems, 10(1):97–106, March 1985.Google Scholar
  21. [SL88]
    J. Srivastava and V.L. Lum. A tree based access method (tbsam) for fast processing of aggregate queries. In Proceedings of the 4th International Conference on Data Engineering, pages 504–510. IEEE Computer Scoeity, 1988.Google Scholar
  22. [Vit84]
    Jeffrey Scott Vitter. Faster methods of random sampling. Communications of the ACM, 27(7):703–718, July 1984.Google Scholar
  23. [Vit85]
    Jeffrey Scott Vitter. Random sampling with a reservoir. ACM Transactions on Mathematical Software, 11(1):37–57, March 1985.Google Scholar
  24. [WE80]
    C.K. Wong and M.C. Easton. An efficient method for weighted sampling without replacement. SIAM Journal on Computing, 9(1):111–113, February 1980.Google Scholar
  25. [Wil84]
    Dan Willard. Sampling algorithms for differential batch retrieval problems (extended abstract). In Proceedings ICALP-84. Springer-Verlag, 1984.Google Scholar
  26. [Yao77]
    S. Bing Yao. Approximating the number of accesses in database organizations. Communications of the ACM, 20(4):260–261, April 1977.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1990

Authors and Affiliations

  • Frank Olken
    • 1
  • Doron Rotem
    • 1
  1. 1.Computer Science Research Dept. Information and Computing Sciences Div.Lawrence Berkeley LaboratoryBerkeley

Personalised recommendations